网址：https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format

Rocksdb BlockBasedTable Format

This page is forked from LevelDB's document on table format, and reflects changes we have made during the development of RocksDB.
该文档forked from LevelDB的 table format文档，参考了开发中的修改内容。

BlockBasedTable is the default SST table format in RocksDB.
BlockBasedTable是rocksdb的默认sst table。

File format

<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1: filter block]                  (see section: "filter" Meta Block)
[meta block 2: index block]
[meta block 3: compression dictionary block]  (see section: "compression dictionary" Meta Block)
[meta block 4: range deletion block]          (see section: "range deletion" Meta Block)
[meta block 5: stats block]                   (see section: "properties" Meta Block)
...
[meta block K: future extended block]  (we may add more meta blocks in the future)
[metaindex block]
[Footer]                               (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>

The file contains internal pointers, called BlockHandles, containing the following information:
文件包含internal pointers，称为BlockHandles，包含如下信息：

offset:         varint64
size:           varint64

See this document for an explanation of varint64 format.
展开varint64 format参考this document。

(1) The sequence of key/value pairs in the file are stored in sorted order and partitioned into a sequence of data blocks. These blocks come one after another at the beginning of the file. Each data block is formatted according to the code in block_builder.cc (see code comments in the file), and then optionally compressed.
key/value对的需要存储在一个排序后的序号数据段。这些段存储在开始位置。每个块存储格式参考block_builder.cc，可压缩。

(2) After the data blocks, we store a bunch of meta blocks. The supported meta block types are described below. More meta block types may be added in the future. Each meta block is again formatted using block_builder.cc and then optionally compressed.
data block后存储meta blocks。支持类型介绍如下。未来可能加入更多类型。每个meta block格式使用block_builder.cc定义，可压缩。

(3) A metaindex block contains one entry for every meta block, where the key is the name of the meta block and the value is a BlockHandle pointing to that meta block.
metaindex包含对应每个meta block的内容，每个包含meta block的名字和meta block的BlockHandle。

(4) At the very end of the file is a fixed length footer that contains the BlockHandle of the metaindex and index blocks as well as a magic number.
文件末尾是固定长度的footer，包含metaindex和index blocks的BlockHandle和一个magic number。

   metaindex_handle: char[p];      // Block handle for metaindex
   index_handle:     char[q];      // Block handle for index
   padding:          char[40-p-q]; // zeroed bytes to make fixed length
                                   // (40==2*BlockHandle::kMaxEncodedLength)
   magic:            fixed64;      // 0x88e241b785f4cff7 (little-endian)

Index Block

Index blocks are used to look up a data block containing the range including a lookup key. It is a binary search data structure. A file may contain one index block, or a list of partitioned index blocks (see Partitioned Index Filters). Index block format is documented here: Index Block Format.
Index blocks用于范围搜索。是一个二分查找的数据结构。一个文件可以包含一个index block或者以一个分段的index block列表。Index block格式介绍文档：Index Block Format。

Filter Meta Block

Note: format_version=5 (Since RocksDB 6.6) uses a faster and more accurate Bloom filter implementation for full and partitioned filters.
注意：format_version=5使用一个更快速和更准确的Bloom filter实现全部和分段filter。

Full filter

In this filter there is one filter block for the entire SST file.
在文件中有一个filter用于整个文件。

Partitioned Filter

The full filter is partitioned into multiple blocks. A top-level index block is added to map keys to corresponding filter partitions. Read more here.
full filter分段。一个顶层的索引块用于搜索filter。更多参考here。

Block-based filter

弃用了，忽略。

Note: the below explains block based filter, which is deprecated.
注意：弃用了。

If a "FilterPolicy" was specified when the database was opened, a filter block is stored in each table. The "metaindex" block contains an entry that maps from "filter." to the BlockHandle for the filter block, where "" is the string returned by the filter policy's Name() method.

The filter block stores a sequence of filters, where filter i contains the output of FilterPolicy::CreateFilter() on all keys that are stored in a block whose file offset falls within the range

[ ibase ... (i+1)base-1 ]
Currently, "base" is 2KB. So, for example, if blocks X and Y start in the range [ 0KB .. 2KB-1 ], all of the keys in X and Y will be converted to a filter by calling FilterPolicy::CreateFilter(), and the resulting filter will be stored as the first filter in the filter block.

The filter block is formatted as follows:

 [filter 0]
 [filter 1]
 [filter 2]
 ...
 [filter N-1]

 [offset of filter 0]                  : 4 bytes
 [offset of filter 1]                  : 4 bytes
 [offset of filter 2]                  : 4 bytes
 ...
 [offset of filter N-1]                : 4 bytes

 [offset of beginning of offset array] : 4 bytes
 lg(base)                              : 1 byte

The offset array at the end of the filter block allows efficient mapping from a data block offset to the corresponding filter.

Properties Meta Block

This meta block contains a bunch of properties. The key is the name of the property. The value is the property.
meta block包含许多properties。key是property名字。值是property。

The stats block is formatted as follows:
stats block格式如下：

 [prop1]    (Each property is a key/value pair)
 [prop2]
 ...
 [propN]

Properties are guaranteed to sort with no duplication.
Properties排序保证不重复。

By default, each table provides the following properties.
默认每个table提供如下properties。

 data size               // the total size of all data blocks. 
 index size              // the size of the index block.
 filter size             // the size of the filter block.
 raw key size            // the size of all keys before any processing.
 raw value size          // the size of all value before any processing.
 number of entries
 number of data blocks

RocksDB also provides users the "callback" to collect their interested properties about this table. Please refer to UserDefinedPropertiesCollector.
rocksdb提供一个callback，用户可以收集自己感兴趣的内容。参考UserDefinedPropertiesCollector。

Compression Dictionary Meta Block

This metablock contains the dictionary used to prime the compression library before compressing/decompressing each block. Its purpose is to address a fundamental problem with dynamic dictionary compression algorithms on small data blocks: the dictionary is built during a single pass over the block, so small data blocks always have small and thus ineffective dictionaries.
metablock包含使用的主要字典库。目的是处理小数据块问题：
字典是在对块进行一次传递时构建的，所以小数据块总是有小而无效的字典。（有道）

Our solution is to initialize the compression library with a dictionary built from data sampled from previously seen blocks. This dictionary is then stored in a file-level meta-block for use during decompression. The upper-bound on the size of this dictionary is configurable via CompressionOptions::max_dict_bytes. By default it is zero, i.e., the block is not generated or stored. Currently this feature is supported with kZlibCompression, kLZ4Compression, kLZ4HCCompression, and kZSTDNotFinalCompression.
我们的解决方案是用一个字典来初始化压缩库，这个字典是从前面看到的块中采样的数据构建的。然后，该字典被存储在文件级元块中，以便在解压过程中使用。字典大小的上限可以通过CompressionOptions::max_dict_bytes配置。默认情况下，它是零，也就是说，块不会生成或存储。目前，kZlibCompression、kLZ4Compression、kLZ4HCCompression和kZSTDNotFinalCompression都支持该特性。（有道）

More specifically, the compression dictionary is built only during compaction to the bottommost level, where the data is largest and most stable. To avoid iterating over input data multiple times, the dictionary includes samples from the subcompaction's first output file only. Then, the dictionary is applied to and stored in meta-blocks of all subsequent output files. Note the dictionary is not applied to or stored in the first file since its contents are not finalized until that file has been fully processed.
更具体地说，压缩字典只在压缩到最底层时构建，在底层数据最大且最稳定。为了避免多次迭代输入数据，字典只包含来自subcompaction的第一个输出文件的示例。然后，字典被应用到所有后续输出文件的元块中并存储在其中。注意，字典不会应用于第一个文件，也不会存储在第一个文件中，因为字典的内容直到该文件被完全处理后才最终确定。（有道）（待分析）

Currently the sampling is uniformly random and each sample is 64 bytes. We do not know in advance the size of the output file when selecting the sample offsets, so we assume it'll reach the maximum size, which is usually true since it's the first file in the subcompaction. In case the file is smaller, some sample intervals will refer to offsets beyond EOF, which just means the dictionary will be a bit smaller than CompressionOptions::max_dict_bytes.
目前的采样是均匀随机的，每个采样是64字节。在选择样本偏移量时，我们不预先知道输出文件的大小，因此我们假设它将达到最大大小，这通常是正确的，因为它是subcompaction中的第一个文件。在文件较小的情况下，一些样本间隔将引用超出EOF的偏移量，这只是意味着字典将比CompressionOptions::max_dict_bytes小一点。（有道）（待分析）

Range Deletion Meta Block

This metablock contains the range deletions in the file's key-range and seqnum-range. Range deletions cannot be inlined in the data blocks together with point data since the ranges would then not be binary searchable.
这个metablock包含文件的key-range和seqnum-range中的范围删除。范围删除不能与点数据一起内联在数据块中，因为范围将无法通过二进制搜索。（有道）（待分析）

The block format is the standard key-value format. A range deletion is encoded as follows:
块格式是标准的键值格式。一个范围删除被编码如下:（有道）

User key: the range's begin key
Sequence number: the sequence number at which the range deletion was inserted to the DB
Value type: kTypeRangeDeletion
Value: the range's end key

Range deletions are assigned sequence numbers when inserted using the same mechanism as non-range data types (puts, deletes, etc.). They also traverse through the LSM using the same flush/compaction mechanism as point data. They can be obsoleted (i.e., dropped) only during compaction to the bottommost level.
当使用与非范围数据类型相同的机制(放置、删除等)插入时，范围删除被分配序列号。它们还使用与点数据相同的刷新/压缩机制遍历LSM。只有在压实到最底层的过程中，它们才会被废弃(即被丢弃)。（有道）（待分析）

翻译 SST File Formats Rocksdb BlockBasedTable Format