在做实验的时候,对比一下 HBase 与 PostgreSQL的查询速度,发现 PostgreSQL 只需要300毫秒左后的查询放在 HBase 中竟然需要5秒左右,这效率也差太多了吧!也排除了是数据库连接等操作的耗时,这就需要深入的找一下原因了。原 HBase 代码如下,设置了两个 PrefixFilter 用来过滤,目的是找到某一个 mac 从 beginTime 到 endTime 的数据:

1
2
3
4
5
6
7
8
9
10
FilterList fl = new FilterList(FilterList.Operator.MUST_PASS_ALL);// must between beginTime and endTime
BinaryPrefixComparator comp = new BinaryPrefixComparator(Bytes.toBytes(String.format("%s%d", mac, beginTimeStamp)));
RowFilter filter = new RowFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, comp);
BinaryPrefixComparator comp2 = new BinaryPrefixComparator(Bytes.toBytes(String.format("%s%d", mac, endTimeStamp)));
RowFilter filter2 = new RowFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, comp2);
fl.addFilter(filter);
fl.addFilter(filter2);
Scan scan = new Scan();
scan.setFilter(fl);
//....other colde here

owkey 不是字典有序有索引的吗,查找两个前缀限制夹出来的行应该很快的啊,不然 Rowkey 索引是做什么用的?一步一步找原因,需要搞清楚 Filter 到底是做什么用的!经过一番搜索,在 StackOverFlow [1] 上找到一个回答,意思是 Filter 是非常慢的,要进行全表扫描的,如果想要快速的查询数据,得设置 STARTROW 和 ENDROW,原文:

HBase filters - even row filters - are really slow, since in most cases these do a complete table scan, and then filter on those results. Have a look at this discussion: http://grokbase.com/p/hbase/user/115cg0d7jh/very-slow-scan-performance-using-filters

Row key range scans however, are indeed much faster - they do the equivalent of a filtered table scan. This is because the row keys are stored in sorted order (this is one of the basic guarantees of HBase, which is a BigTable-like solution), so the range scans on row keys are very fast. More explanation here: http://www.quora.com/How-feasible-is-real-time-querying-on-HBase-Can-it-be-achieved-through-a-programming-language-such-as-Python-PHP-or-JSP

UPDATE: turns out that PrefixFilter does do a full table scan until it passes the prefix used in the filter (if it finds it). The recommendation for fast performance using a PrefixFilter seems to be to specify a start_row parameter in addition to the PrefixFilter. See related 2013 discussion on the hbase-user mailing list.

另外找到一个比较明确的解释如下 [2]:

Filters push row selection criteria out to the HBase region servers for processing so that rows can be filtered remotely and in parallel (when more than one region server is involved). Using these functions helps you to avoid sending rows to the client that are not needed.

好吧,原来是跟我理解的有偏差,这里边的 Filter 作用并不能直接用来索引到要检索的数据,从描述中可以看出, 只使用 Filter 的话,还是要进行全表扫描,只是符合 Filter 的数据才会被发送到 Client ,从 HBase Java API DOC [3]也可以看出:

If an already known row range needs to be scanned, use CellScanner start and stop rows directly rather than a filter.

原因找到了,就按照说明添加一个开始行键和结束行健吧,行健根据自己的需要进行设置!

1
2
3
4
//....
scan.setStartRow(Bytes.toBytes(String.format("%s%d0000", mac, beginTimeStamp)));
scan.setStopRow(Bytes.toBytes(String.format("%s%d0300", mac, endTimeStamp)));
//...

经过改进,同样的查询时间就降至毫秒的数量级了。另外还有一个 WhileMatchFilter 的 Warpper 类,作用是自动结束扫描,需要的朋友可以使用一下,可以参见[4]里边的示例,就知道如何使用了。

总结 :最大的误区是错误的理解了 Filter 的作用,认为过滤在前,扫描在后,实际过程却是扫描 -> 过滤,所以如果要利用 Rowkey 本身的索引,除了 get 指定行健之外,scan 必须指定开始行健和结束行健,不然进行的全部是全表扫描,无论 RowFilter 是 RegexFilter 还是 PrefixFilter 还是其他!

参考:

  1. Should I user prefixfilter or rowkey range scan in hbase
  2. IBM Knowledge Center - HBase Module
  3. Class RowFilter
  4. Very slow Scan performance using Filters
  5. [HBase-user] Why RowFilter plus BinaryPrefixComparator solution is so slow