上篇我们提到了HBase建立二级索引的原理,但是当我们自己建立二级索引,就需要做额外很多工作,我们需要定义Coprocessor,在数据更新的时候更新索引,查询时先去检索索引表等,定义这些Coprocessor不仅费时费力,而且一般使用者不能保证程序的高效性、健壮性等。还好,已经有不少构建在HBase上的插件——个人觉得称之为插件更为合理,已经帮我们做了这些工作,Phoenix就是其中一种。

Phoenix提供的二级索引按索引对象可以分为Covered Indexes 和 Functional Indexes,Covered Indexed就是对原始数据进行所以,Functional Indexes就是就原始数据进行处理后再索引,区别从官方给出的两个例子就可以很明显的看出,不再讨论。

1
2
3
4
5
6
# Covered Indexes
CREATE INDEX my_index ON my_table (v1,v2) INCLUDE(v3)

# Functional Indexes
CREATE INDEX UPPER_NAME_IDX ON EMP (UPPER(FIRST_NAME||' '||LAST_NAME))
SELECT EMP_ID FROM EMP WHERE UPPER(FIRST_NAME||' '||LAST_NAME)='JOHN DOE'

另外Phoenix的索引分为Global IndexesLocal Indexes,官方文档提到,Global Index适合适用于读多写少的场景,Local Index适合频繁写的场景,那么这两种索引到底有什么区别呢?

1 建立数据表,明确需求

因为是课题需要,已经存在一张数据表BIGJOY.IMOS,大概有400万行的数据,最初就是使用Phoenix建立的,给大家看一下该表的结构,注意该cell不是HBase的Cell,不要混淆。

COLUMN_NAME DATA_TYPE TYPE_NAME
ID 1 CHAR
FLOOR 1 CHAR
TIME 93 TIMESTAMP
MAC 1 CHAR
X 4 INTEGER
Y 4 INTEGER
CELL 1 CHAR

需求:1.根据mac和time约束查询cell;2.根据cell和time约束查询mac。以需求1为例,探索Phoenix Secondary Index的工作原理。

2 对比案例

为了方便和严谨,本文建立索引前后进行了下面的八个查询,也是从这八个查询探究Phoenix Secondary Index,下面就先列出这8个查询语句。

编号 SQl 查询结果
1 select cell from bigjoy.imos where mac = '28E14CA61635'; 28条记录
2 select time from bigjoy.imos where mac = '28E14CA61635'; 28条记录
3 select floor from bigjoy.imos where mac = '28E14CA61635'; 28条记录
4 select * from bigjoy.imos where mac = '28E14CA61635'; 28条记录
5 select mac from bigjoy.imos where time = to_date('2014-04-01 20:39:42.000'); 223条记录
6 select cell from bigjoy.imos where time = to_date('2014-04-01 20:39:42.000'); 223条记录
7 select cell from bigjoy.imos where mac = '28E14CA61635' and time = to_date('2014-04-01 20:39:42.000'); 1条记录
8 select * from bigjoy.imos where mac = '28E14CA61635' and time = to_date('2014-04-01 20:39:42.000'); 1条记录

3 建立mac-time全局索引

下面建立针对mac的索引

1
2
CREATE INDEX bigjoy.mac_idx on bigjoy.imos (mac,time) include (cell)
# global index default.

4 效率对比

针对2中的8条SQL,再次进行查询,与建立索引前的对比如下表:

编号 建立索引前(秒) 建立索引后(秒)
1 7.236 0.175
2 6.553 0.029
3 7.581 6.826
4 7.171 6.616
5 7.129 9.926
6 7.258 7.039
7 7.632 0.026
8 7.682 7.383

可以看出,建立索引是选择的是(mac,time)字段,并且包含了cell字段,以mac为约束查询time或cell字段效率提升很大,但当查询floor字段或者包含其他字段的时候,索引的存在并没有提升效率,以time为约束的查询在索引建立前后也没有效率上的改变。

下面就深入分析一下原因。

4.1 查看一下数据表有木有变化

索引建立好后,先使用Phoenix的 !tables 命令查看一下所用的数据表,从结果中可以看出多出了一个BIGJOY.MAC_IDX的数据表,如图所示。

使用Phoenix建立索引后多出一个数据表

4.2 探索索引表的结构

首先看一下该表都有哪些字段呀!

1
2
3
4
5
6
7
8
9
0: jdbc:phoenix:localhost> !describe bigjoy.mac_idx
+------------+--------------+-------------+--------------+------------+------------+--------------+----------------+-----------------+-----------------+-----+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | DATA_TYPE | TYPE_NAME | COLUMN_SIZE | BUFFER_LENGTH | DECIMAL_DIGITS | NUM_PREC_RADIX | NUL |
+------------+--------------+-------------+--------------+------------+------------+--------------+----------------+-----------------+-----------------+-----+
| | BIGJOY | MAC_IDX | 0:MAC | 12 | VARCHAR | 12 | null | null | null | 1 |
| | BIGJOY | MAC_IDX | 0:TIME | 3 | DECIMAL | null | null | null | null | 1 |
| | BIGJOY | MAC_IDX | :ID | 1 | CHAR | 26 | null | null | null | 0 |
| | BIGJOY | MAC_IDX | 0:CELL | 1 | CHAR | 4 | null | null | null | 1 |
+------------+--------------+-------------+--------------+------------+------------+--------------+----------------+-----------------+-----------------+-----+

注意,Phoenix把原来的TimeStamp类型的时间变成了Decimal,应该是用整数的时间戳来表示时间了,这样运算起来简单!

接下来看一下该表的主键构成!primarykeys bigjoy.mac_idx,结果如下。

1
2
3
4
5
6
7
+------------+--------------+-------------+--------------+----------+----------+--------------+------------+------------+--------------+-+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | KEY_SEQ | PK_NAME | ASC_OR_DESC | DATA_TYPE | TYPE_NAME | COLUMN_SIZE | |
+------------+--------------+-------------+--------------+----------+----------+--------------+------------+------------+--------------+-+
| | BIGJOY | MAC_IDX | 0:MAC | 1 | | A | 12 | VARCHAR | 12 | |
| | BIGJOY | MAC_IDX | 0:TIME | 2 | | A | 3 | DECIMAL | null | |
| | BIGJOY | MAC_IDX | :ID | 3 | | A | 1 | CHAR | 26 | |
+------------+--------------+-------------+--------------+----------+----------+--------------+------------+------------+--------------+-+

可以看出,该表的主键是由0:MAC,0:TIME,:ID按顺序构成的,不过:ID是什么玩意儿,是原数据表的主键?为了验证猜想,请接着看(。・ω・)

1
2
3
4
5
6
7
0: jdbc:phoenix:localhost> select * from bigjoy.mac_idx limit 1;
+---------------+------------------+-----------------------------+---------+
| 0:MAC | 0:TIME | :ID | 0:CELL |
+---------------+------------------+-----------------------------+---------+
| 000000000C00 | 1.396363024E+12 | 02281396334224000000000C00 | 228 |
+---------------+------------------+-----------------------------+---------+
1 row selected (0.123 seconds)

这里也可以看出该索引表(是真的能被用户看到的表哈)包含了4个字段,前三个刚才讲了,是构成主键的组成部分,0:CELL很好理解,就是创建索引时include 后边的cell字段了。有:ID了不是,好的,去hbase shell中利用:ID查询一下数据!

1
2
3
4
5
6
7
8
9
10
hbase(main):038:0> get 'BIGJOY.IMOS','02281396334224000000000C00'
COLUMN CELL
0:CELL timestamp=1470667019714, value=228
0:FLOOR timestamp=1470667019714, value=20010
0:MAC timestamp=1470667019714, value=000000000C00
0:TIME timestamp=1470667019714, value=\x80\x00\x01E\x1D\xB8\xD6\x80\x00\x00\x00\x00
0:X timestamp=1470667019714, value=\x80\x01\x85\x1D
0:Y timestamp=1470667019714, value=\x80\x01\x82\xE8
0:_0 timestamp=1470667019714, value=x
7 row(s) in 0.0260 seconds

嗯,果不其然呀, :ID在索引表中参与构成主键,首先是保证了索引表主键的唯一性,因此Global Index是密集索引,另外,它对查询也是有帮助的,埋个伏笔,下面会提到!

4.3 查询计划解读

4.3.1 Phoenix提供了Explain命令,解释执行计划,下面先看mac约束的对比吧,咣当咣当真是够长的( ̄ˇ ̄)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

0: jdbc:phoenix:localhost> explain select cell from bigjoy.imos where mac = '28E14CA61635';
+-----------------------------------------------------------------------------------------------------------+
| PLAN |
+-----------------------------------------------------------------------------------------------------------+
| CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER BIGJOY.MAC_IDX ['28E14CA61635'] |
+-----------------------------------------------------------------------------------------------------------+
1 row selected (0.584 seconds)

0: jdbc:phoenix:localhost> explain select time from bigjoy.imos where mac = '28E14CA61635';
+-----------------------------------------------------------------------------------------------------------+
| PLAN |
+-----------------------------------------------------------------------------------------------------------+
| CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER BIGJOY.MAC_IDX ['28E14CA61635'] |
| SERVER FILTER BY FIRST KEY ONLY |
+-----------------------------------------------------------------------------------------------------------+
2 rows selected (0.038 seconds)

0: jdbc:phoenix:localhost> explain select * from bigjoy.imos where mac = '28E14CA61635';
+------------------------------------------------------------------------------------------------------+
| PLAN |
+------------------------------------------------------------------------------------------------------+
| CLIENT 6-CHUNK 22526956 ROWS 1258291413 BYTES PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER BIGJOY.IMOS |
| SERVER FILTER BY MAC = '28E14CA61635' |
+------------------------------------------------------------------------------------------------------+
2 rows selected (0.038 seconds)

0: jdbc:phoenix:localhost> explain select * from bigjoy.imos where mac = '28E14CA61635';
+------------------------------------------------------------------------------------------------------+
| PLAN |
+------------------------------------------------------------------------------------------------------+
| CLIENT 6-CHUNK 22526956 ROWS 1258291413 BYTES PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER BIGJOY.IMOS |
| SERVER FILTER BY MAC = '28E14CA61635' |
+------------------------------------------------------------------------------------------------------+
2 rows selected (0.036 seconds)

看到这里就明白了吧,当查询cell字段的时候,是直接去索引表了提取出了cell的值;但当包含了创建索引表时没有提供的字段的时候,Phoenix就是回到原数据表进行全表扫描了,不过是最终在服务器端进行了Filter过滤而已,关于为什么使用了Filter还是进行的全表扫描哦,可以参见HBase RowFilter 的使用误区及如何加快 HBase 查询速度

4.3.2 那么接下来看一下time约束的索引查询计划!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

0: jdbc:phoenix:localhost> explain select cell from bigjoy.imos where time = to_date('2014-04-01 20:39:42.000');
+-----------------------------------------------------------------------------------------+
| PLAN |
+-----------------------------------------------------------------------------------------+
| CLIENT 4-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER BIGJOY.MAC_IDX |
| SERVER FILTER BY TO_TIMESTAMP("TIME") = TIMESTAMP '2014-04-01 20:39:42.000' |
+-----------------------------------------------------------------------------------------+
2 rows selected (0.062 seconds)

0: jdbc:phoenix:localhost> explain select * from bigjoy.imos where time = to_date('2014-04-01 20:39:42.000');
+------------------------------------------------------------------------------------------------------+
| PLAN |
+------------------------------------------------------------------------------------------------------+
| CLIENT 6-CHUNK 22526956 ROWS 1258291413 BYTES PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER BIGJOY.IMOS |
| SERVER FILTER BY TIME = TIMESTAMP '2014-04-01 20:39:42.000' |
+------------------------------------------------------------------------------------------------------+
2 rows selected (0.039 seconds)

可以看出,仅有time约束的时候,不管进行的是cell查询还是包含其他字段的查询,都是进行全表扫描,不过一个是扫描索引表,一个是原数据表罢了!所以效率没有提升

4.3.3 最后,同时进行mac和time约束查询,从时间对比上就可以猜想是进行了索引表的查询,当然,查询目标不能包含创建索引表时未包含的字段,事实也是如此。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

0: jdbc:phoenix:localhost> explain select cell from bigjoy.imos where mac = '28E14CA61635' and time = to_date('2014-04-01 20:39:42.000');
+-----------------------------------------------------------------------------------------------------------------------------+
| PLAN |
+-----------------------------------------------------------------------------------------------------------------------------+
| CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER BIGJOY.MAC_IDX ['28E14CA61635',1,396,384,782,000] |
+-----------------------------------------------------------------------------------------------------------------------------+
1 row selected (0.049 seconds)

0: jdbc:phoenix:localhost> explain select * from bigjoy.imos where mac = '28E14CA61635' and time = to_date('2014-04-01 20:39:42.000');
+------------------------------------------------------------------------------------------------------+
| PLAN |
+------------------------------------------------------------------------------------------------------+
| CLIENT 6-CHUNK 22526956 ROWS 1258291413 BYTES PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER BIGJOY.IMOS |
| SERVER FILTER BY (MAC = '28E14CA61635' AND TIME = TIMESTAMP '2014-04-01 20:39:42.000') |
+------------------------------------------------------------------------------------------------------+
2 rows selected (0.034 seconds)

4.3.4 还有,看一下官方给出的Index Hint方法,可以看出,:ID参与索引表的构成,也是为了实现这一个目的,先查索引表,在根据ID查原表!

1
2
3
4
5
6
7
8
9
10
11
0: jdbc:phoenix:localhost> explain select /*+ INDEX(bigjoy.imos mac_idx) */ * from bigjoy.imos where mac = '28E14CA61635';
+-------------------------------------------------------------------------------------------------------------------+
| PLAN |
+-------------------------------------------------------------------------------------------------------------------+
| CLIENT 6-CHUNK 22526956 ROWS 1258291413 BYTES PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER BIGJOY.IMOS |
| SKIP-SCAN-JOIN TABLE 0 |
| CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY ROUND ROBIN RANGE SCAN OVER BIGJOY.MAC_IDX ['28E14CA61635'] |
| SERVER FILTER BY FIRST KEY ONLY |
| DYNAMIC SERVER FILTER BY "BIGJOY.IMOS.ID" IN ($25.$27) |
+-------------------------------------------------------------------------------------------------------------------+
5 rows selected (0.041 seconds)

4.4 什么时候索引发挥作用,什么时候索引不发挥作用(都是指全局索引)?

通过以上几个例子对比和查询计划分析,并且结合索引表的主键构成(谁让HBase有且只有主键的字典序索引呢),查询字段只包括创建索引时的字段并且约束条件中包括构成索引表主键的第一个字段时,索引表时发挥作用的,其他情况下全局索引不能加快查询速度,这也和官方文档给出的解释是一样样的!

文中使用的例子只包括两个字段约束,如果有三个字段呢,结果又如何?如:(field1, field2, field3),可以猜想使用field1, field1 + field2, field1 + field3, field1 + field2 + field3进行约束查询,该索引表才会发挥作用!感兴趣的可以去尝试一下!

为什么是这样?还是因为HBase只有字典序的主键索引啊,如果不使用field1进行约束,那么将不能确定索引表中行健的扫描范围,所以仍然要进行全局扫描(索引表或者原数据表),同理,如果查询结果一下,使用field1 + field2 约束理论上是要优于 field1 + field3的,不过既然查询结果都一样了,那么field2 和 field3应该是没有约束作用的,除非field3 == field3。感兴趣的也可以尝试一下!

5总结

  • Global Index 是密集索引;
  • 要想索引发挥作用,必须包含创建索引是的第一个字段进行约束;
  • 索引字段从前到后,约束越多,查询效率越高(中间不能跳过索引字段),这样可以快速缩小扫描范围;
  • 查询一定不能有索引字段之外的其他字段,不然得去扫描原表;
  • 不满足上一条结论的要求时,官方文档提供的Index Hint方法,是根据索引表得到原表主键,再去原表中根据主键查询其他字段!

Local Index 就留着下一次写吧!