翻译 SST File Formats Rocksdb BlockBasedTable Format

网址:https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format

Rocksdb BlockBasedTable Format

This page is forked from LevelDB's document on table format, and reflects changes we have made during the development of RocksDB.
该文档forked from LevelDB的 table format文档,参考了开发中的修改内容。

BlockBasedTable is the default SST table format in RocksDB.
BlockBasedTable是rocksdb的默认sst table。

File format

<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]                  (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block]  (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block]          (see section: "range deletion" Meta Block)
[meta block 5: stats block]                   (see section: "properties" Meta Block)
...
[meta block K: future extended block]  (we may add more meta blocks in the future)
[metaindex block]
[Footer]                               (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>

The file contains internal pointers, called BlockHandles, containing the following information:
文件包含internal pointers,称为BlockHandles,包含如下信息:

offset:         varint64
size:           varint64

See this document for an explanation of varint64 format.
展开varint64 format参考this document

(1) The sequence of key/value pairs in the file are stored in sorted order and partitioned into a sequence of data blocks. These blocks come one after another at the beginning of the file. Each data block is formatted according to the code in block_builder.cc (see code comments in the file), and then optionally compressed.
key/value对的需要存储在一个排序后的序号数据段。这些段存储在开始位置。 每个块存储格式参考block_builder.cc,可压缩。

(2) After the data blocks, we store a bunch of meta blocks. The supported meta block types are described below. More meta block types may be added in the future. Each meta block is again formatted using block_builder.cc and then optionally compressed.
data block后存储meta blocks。支持类型介绍如下。未来可能加入更多类型。每个meta block格式使用block_builder.cc定义,可压缩。

(3) A metaindex block contains one entry for every meta block, where the key is the name of the meta block and the value is a BlockHandle pointing to that meta block.
metaindex包含对应每个meta block的内容,每个包含meta block的名字和meta block的BlockHandle。

(4) At the very end of the file is a fixed length footer that contains the BlockHandle of the metaindex and index blocks as well as a magic number.
文件末尾是固定长度的footer,包含metaindex和index blocks的BlockHandle和一个magic number。

   metaindex_handle: char[p];      // Block handle for metaindex
   index_handle:     char[q];      // Block handle for index
   padding:          char[40-p-q]; // zeroed bytes to make fixed length
                                   // (40==2*BlockHandle::kMaxEncodedLength)
   magic:            fixed64;      // 0x88e241b785f4cff7 (little-endian)

Index Block

Index blocks are used to look up a data block containing the range including a lookup key. It is a binary search data structure. A file may contain one index block, or a list of partitioned index blocks (see Partitioned Index Filters). Index block format is documented here: Index Block Format.
Index blocks用于范围搜索。是一个二分查找的数据结构。一个文件可以包含一个index block或者以一个分段的index block列表。Index block格式介绍文档:Index Block Format

Filter Meta Block

Note: format_version=5 (Since RocksDB 6.6) uses a faster and more accurate Bloom filter implementation for full and partitioned filters.
注意:format_version=5使用一个更快速和更准确的Bloom filter实现全部和分段filter。

Full filter

In this filter there is one filter block for the entire SST file.
在文件中有一个filter用于整个文件。

Partitioned Filter

The full filter is partitioned into multiple blocks. A top-level index block is added to map keys to corresponding filter partitions. Read more here.
full filter分段。一个顶层的索引块用于搜索filter。更多参考here

Block-based filter

弃用了,忽略。

Note: the below explains block based filter, which is deprecated.
注意:弃用了。

If a "FilterPolicy" was specified when the database was opened, a filter block is stored in each table. The "metaindex" block contains an entry that maps from "filter." to the BlockHandle for the filter block, where "" is the string returned by the filter policy's Name() method.

The filter block stores a sequence of filters, where filter i contains the output of FilterPolicy::CreateFilter() on all keys that are stored in a block whose file offset falls within the range

[ ibase ... (i+1)base-1 ]
Currently, "base" is 2KB. So, for example, if blocks X and Y start in the range [ 0KB .. 2KB-1 ], all of the keys in X and Y will be converted to a filter by calling FilterPolicy::CreateFilter(), and the resulting filter will be stored as the first filter in the filter block.

The filter block is formatted as follows:

 [filter 0]
 [filter 1]
 [filter 2]
 ...
 [filter N-1]

 [offset of filter 0]                  : 4 bytes
 [offset of filter 1]                  : 4 bytes
 [offset of filter 2]                  : 4 bytes
 ...
 [offset of filter N-1]                : 4 bytes

 [offset of beginning of offset array] : 4 bytes
 lg(base)                              : 1 byte

The offset array at the end of the filter block allows efficient mapping from a data block offset to the corresponding filter.

Properties Meta Block

This meta block contains a bunch of properties. The key is the name of the property. The value is the property.
meta block包含许多properties。key是property名字。值是property。

The stats block is formatted as follows:
stats block格式如下:

 [prop1]    (Each property is a key/value pair)
 [prop2]
 ...
 [propN]

Properties are guaranteed to sort with no duplication.
Properties排序保证不重复。

By default, each table provides the following properties.
默认每个table提供如下properties。

 data size               // the total size of all data blocks. 
 index size              // the size of the index block.
 filter size             // the size of the filter block.
 raw key size            // the size of all keys before any processing.
 raw value size          // the size of all value before any processing.
 number of entries
 number of data blocks

RocksDB also provides users the "callback" to collect their interested properties about this table. Please refer to UserDefinedPropertiesCollector.
rocksdb提供一个callback,用户可以收集自己感兴趣的内容。参考UserDefinedPropertiesCollector。

Compression Dictionary Meta Block

This metablock contains the dictionary used to prime the compression library before compressing/decompressing each block. Its purpose is to address a fundamental problem with dynamic dictionary compression algorithms on small data blocks: the dictionary is built during a single pass over the block, so small data blocks always have small and thus ineffective dictionaries.
metablock包含使用的主要字典库。目的是处理小数据块问题:
字典是在对块进行一次传递时构建的,所以小数据块总是有小而无效的字典。(有道)

Our solution is to initialize the compression library with a dictionary built from data sampled from previously seen blocks. This dictionary is then stored in a file-level meta-block for use during decompression. The upper-bound on the size of this dictionary is configurable via CompressionOptions::max_dict_bytes. By default it is zero, i.e., the block is not generated or stored. Currently this feature is supported with kZlibCompression, kLZ4Compression, kLZ4HCCompression, and kZSTDNotFinalCompression.
我们的解决方案是用一个字典来初始化压缩库,这个字典是从前面看到的块中采样的数据构建的。然后,该字典被存储在文件级元块中,以便在解压过程中使用。字典大小的上限可以通过CompressionOptions::max_dict_bytes配置。默认情况下,它是零,也就是说,块不会生成或存储。目前,kZlibCompression、kLZ4Compression、kLZ4HCCompression和kZSTDNotFinalCompression都支持该特性。(有道)

More specifically, the compression dictionary is built only during compaction to the bottommost level, where the data is largest and most stable. To avoid iterating over input data multiple times, the dictionary includes samples from the subcompaction's first output file only. Then, the dictionary is applied to and stored in meta-blocks of all subsequent output files. Note the dictionary is not applied to or stored in the first file since its contents are not finalized until that file has been fully processed.
更具体地说,压缩字典只在压缩到最底层时构建,在底层数据最大且最稳定。为了避免多次迭代输入数据,字典只包含来自subcompaction的第一个输出文件的示例。然后,字典被应用到所有后续输出文件的元块中并存储在其中。注意,字典不会应用于第一个文件,也不会存储在第一个文件中,因为字典的内容直到该文件被完全处理后才最终确定。(有道)(待分析)

Currently the sampling is uniformly random and each sample is 64 bytes. We do not know in advance the size of the output file when selecting the sample offsets, so we assume it'll reach the maximum size, which is usually true since it's the first file in the subcompaction. In case the file is smaller, some sample intervals will refer to offsets beyond EOF, which just means the dictionary will be a bit smaller than CompressionOptions::max_dict_bytes.
目前的采样是均匀随机的,每个采样是64字节。在选择样本偏移量时,我们不预先知道输出文件的大小,因此我们假设它将达到最大大小,这通常是正确的,因为它是subcompaction中的第一个文件。在文件较小的情况下,一些样本间隔将引用超出EOF的偏移量,这只是意味着字典将比CompressionOptions::max_dict_bytes小一点。(有道)(待分析)

Range Deletion Meta Block

This metablock contains the range deletions in the file's key-range and seqnum-range. Range deletions cannot be inlined in the data blocks together with point data since the ranges would then not be binary searchable.
这个metablock包含文件的key-range和seqnum-range中的范围删除。范围删除不能与点数据一起内联在数据块中,因为范围将无法通过二进制搜索。(有道)(待分析)

The block format is the standard key-value format. A range deletion is encoded as follows:
块格式是标准的键值格式。一个范围删除被编码如下:(有道)

  • User key: the range's begin key
  • Sequence number: the sequence number at which the range deletion was inserted to the DB
  • Value type: kTypeRangeDeletion
  • Value: the range's end key

Range deletions are assigned sequence numbers when inserted using the same mechanism as non-range data types (puts, deletes, etc.). They also traverse through the LSM using the same flush/compaction mechanism as point data. They can be obsoleted (i.e., dropped) only during compaction to the bottommost level.
当使用与非范围数据类型相同的机制(放置、删除等)插入时,范围删除被分配序列号。它们还使用与点数据相同的刷新/压缩机制遍历LSM。只有在压实到最底层的过程中,它们才会被废弃(即被丢弃)。(有道)(待分析)

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,716评论 4 364
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,558评论 1 294
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 109,431评论 0 244
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,127评论 0 209
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,511评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,692评论 1 222
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,915评论 2 313
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,664评论 0 202
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,412评论 1 246
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,616评论 2 245
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,105评论 1 260
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,424评论 2 254
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,098评论 3 238
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,096评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,869评论 0 197
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,748评论 2 276
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,641评论 2 271

推荐阅读更多精彩内容