HBase的table可以认为是一个多维map,其key依次是rowkey、column、version,存储单元是一个cell。

一定要牢记于心的是,增删改在HBase里都是文件的追加操作,以{row, column, timestamp}为key、cell+操作类型为值 组成的一行。在查询时,是需要整合所有HFile和内存里的条目,才可以拼出最后数据的。所以,文件的大小、个数会影响到查询性能。

增删改,对应着column meta里的type。更确定的说,其实只有put、delete×族的操作,逻辑上的update操作也是一个put(不管是API还是oplog)。而delete会形成一个tombstone 记录。

HBase conceptual vs Bigtable

TODO

Namespace

A namespace is a logical grouping of tables analogous to a database in relation database systems.

它是独立进行配额管理、安全、region server groups的单位。其中regions server groups是指将这个namespace指定在一组region servers上,从而实现程序级别的隔离。

默认有hbase和default两个namespace,前者是系统级,后者是默认用户的ns。

存储顺序

逻辑上,一个table里的rows按字母序升序排列。

物理上,一个table的column family里的column数据是存储在一起的。

TODO,到底是怎样存储的呢?逻辑与物理是如何映射的呢?

delete

删除操作不是实时生效的,而是生成一条删除标记,之后在major compaction的时候,真正执行物理删除动作。删除可以指定version范围。也有一些hbase-site.xml的配置和column family的设置会影响delete操作。

Delete markers are purged during the next major compaction of the store, unless the KEEP_DELETED_CELLS option is set in the column family. To keep the deletes for a configurable amount of time, you can set the delete TTL via the hbase.hstore.time.to.purge.deletes property in hbase-site.xml. If hbase.hstore.time.to.purge.deletes is not set, or set to 0, all delete markers, including those with timestamps in the future, are purged during the next major compaction. Otherwise, a delete marker with a timestamp in the future is kept until the major compaction which occurs after the time represented by the marker’s timestamp plus the value of hbase.hstore.time.to.purge.deletes, in milliseconds.

timestamp and version

Hbase默认使用timestamp作为version,但不是必须的,user也可以自己指定version的值,可以使用过去、现在、未来的时间戳,或压根不使用时间(long integer)。

由于HBase使用(row,column,version)作为三维数组的index,所以如果同一时间(version)写入多个拥有相同row、column的数据,那仅有最后处理的那个数据会被留下,其他数据就默默地丢失了!虽然timestamp可以精确到ms,但仍存在一定冲突的可能性!

不过,由于HBase的一些内部机制以来version timestamp,所以,最好使用默认方式。

Caution: the version timestamp is internally by HBase for things like time-to-live calculations. It’s usually best to avoid setting this timestamp yourself. Prefer using a separate timestamp attribute of the row, or have the timestamp a part of the rowkey, or both.

结合delete操作的原理,即设置tombstone而非立即删除,在不是使用默认时间戳的情况下,可能会有问题(即使使用默认时间戳,在小概率下,也会有问题):即delete可能会有条件的影响后续的put操作

Deletes mask puts, even puts that happened after the delete was entered. See HBASE-2256. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything <= T. After this you do a new put with a timestamp <= T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.

delete的原理,还可能对get multi version造成有条件的影响

…create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore…

即,根据delete t3和major compaction顺序的不同,有两个细节分支:

  1. put t1-t3,delete t3,major compaction => t1、t2  or t2?(TODO,待验证)
  2. put t1-t3,major compaction,delete t3 => t2
key里的version/timestamp还用于在跨数据中心同步时的冲突检测,所以如果自己定义version,可能会导致问题。但一种特例情况是,如果针对多个tables建立二级索引,那么会希望能够获取到一致性(consistent)的数据,这样就最好自己指定version,以便在读取的时候,可以通过version来获取特定的cell。

GC

gc中与version相关点之一是,在添加新cell时,可能会使oldest version过期(超过了maxVersions配置),但真正的删除还是发生在major compaction时!所以,结合上面的delete内容,各种错乱就有可能发生。

另一个gc点是TTL(time-to-live)过期,cell过期的清理也是发生在major compaction时。当一个row下所有的cell都过期了,这个row也就不存在了。

动态column

在创建table的时候,只需要指定column family,而column是put时动态扩展的,这就意味着,无法让HBase告诉我们,有哪些column存在。这些信息,只能由使用者自己维护、理解和使用。

对于ACID的影响

Atomicity:保证单行原子性,包括单行、跨column family的写,都是原子的。但是跨行不支持原子,即批量写时可能会部分成功、失败或超时。对超时情况,可能会成功、也可能会失败。

Consistency&Isolation:单行的get保证一致性;但scan操作只保证“read committed”的一致性,可能会读到不同版本的多行数据,但一行数据仍然是一致的。另外需要注意的是,scan多行时,所依赖的是transaction commit time,即操作的时间,而不是version/timestamp,导致的情况是,如果scan的construction发生在时间t,那么假如在t之前提交了一个cell,虽然其version时间是t之后的,那么这个值也可能会读出。

Please note that the guarantees listed above regarding scanner consistency are referring to “transaction commit time”, not the “timestamp” field of each cell. That is to say, a scanner started at time t may see edits with a timestamp value greater than t, if those edits were committed with a “forward dated” timestamp before the scanner was constructed.

visiablity:保证针对一个row的多次操作是有序的,即针对一个row,多次get的结果顺序,一定遵循多次put的顺序。

Durability:可见(get、scan)的数据,都是持久化、不会丢失的。

参考文档

http://hbase.apache.org/book/datamodel.html

http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

https://issues.apache.org/jira/browse/HBASE-2406

推荐:http://www.ngdata.com/bending-time-in-hbase/

Leave a Reply