翻译 Journal MANIFEST

原网址:https://github.com/facebook/rocksdb/wiki/MANIFEST

(有道)

Overview

RocksDB is file system and storage medium agnostic. File system operations are not atomic, and are susceptible to inconsistencies in the event of system failure. Even with journaling turned on, file systems do not guarantee consistency on unclean restart. POSIX file system does not support atomic batching of operations either. Hence, it is not possible to rely on metadata embedded in RocksDB datastore files to reconstruct the last consistent state of the RocksDB on restart.
RocksDB与文件系统和存储介质无关。文件系统操作不是原子的,在系统发生故障时容易出现不一致。即使打开了日志记录,文件系统也不能保证不正常重启时的一致性。POSIX文件系统也不支持原子批处理操作。因此,我们不可能依赖于RocksDB数据存储文件中的元数据来重新构建RocksDB在重启时的一致状态。

RocksDB has a built-in mechanism to overcome these limitations of POSIX file system by keeping a transactional log of RocksDB state changes using Version Edit Records in the Manifest log files. MANIFEST is used to restore RocksDB to the latest known consistent state on a restart.
RocksDB有一个内置的机制来克服POSIX文件系统的这些限制,通过在Manifest日志文件中使用Version Edit Records来保存RocksDB状态变化的事务日志。MANIFEST用于在重启时将RocksDB恢复到最新的已知一致状态。

Terminology

  • MANIFEST refers to the system that keeps track of RocksDB state changes in a transactional log
    指的是在事务日志中跟踪RocksDB状态变化的系统

  • Manifest log refers to an individual log file that contains RocksDB state snapshot/edits
    是一个包含RocksDB状态快照/编辑的独立日志文件

  • CURRENT refers to the latest manifest log
    引用最新的清单日志

How does it work ?

MANIFEST is a transactional log of the RocksDB state changes. MANIFEST consists of - manifest log files and pointer to the latest manifest file (CURRENT). Manifest logs are rolling log files named MANIFEST-(seq number). The sequence number is always increasing. CURRENT is a special file that points to the latest manifest log file.
MANIFEST是RocksDB状态变化的事务日志。MANIFEST由- MANIFEST日志文件和指向最新MANIFEST文件(CURRENT)的指针组成。清单日志是名为Manifest -(seq number)的滚动日志文件。序列号总是递增的。CURRENT是一个特殊的文件,指向最新的清单日志文件。

On system (re)start, the latest manifest log contains the consistent state of RocksDB. Any subsequent change to RocksDB state is logged to the manifest log file. When a manifest log file exceeds a certain size, a new manifest log file is created with the snapshot of the RocksDB state. The latest manifest file pointer is updated and the file system is synced. Upon successful update to CURRENT file, the redundant manifest logs are purged.
在系统(重新)启动时,最新的清单日志包含了RocksDB的一致状态。任何对RocksDB状态的后续更改都会被记录到清单日志文件中。当一个清单日志文件超过一定的大小时,一个新的清单日志文件就会被创建,其中包含RocksDB状态的快照。最新的清单文件指针被更新,文件系统被同步。成功更新到CURRENT文件后,多余的清单日志将被清除。

MANIFEST = { CURRENT, MANIFEST-<seq-no>* } 
CURRENT = File pointer to the latest manifest log
MANIFEST-<seq no> = Contains snapshot of RocksDB state and subsequent modifications

Version Edit

A certain state of RocksDB at any given time is referred to as a Version (aka snapshot). Any modification to the Version is considered a Version Edit. A Version (or RocksDB state snapshot) is constructed by joining a sequence of version-edits. Essentially, a manifest log file is a sequence of version-edits.
RocksDB在任何时候的特定状态都被称为版本(也就是快照)。对版本的任何修改都被视为版本编辑。一个版本(或RocksDB状态快照)是通过连接一系列版本编辑来构建的。本质上,清单日志文件是版本编辑的序列。

version-edit      = Any RocksDB state change
version           = { version-edit* }
manifest-log-file = { version, version-edit* }
                  = { version-edit* }

Version Edit Layout

Manifest log is a sequence of Version Edit records. The Version Edit record type is identified by the edit identification number.
清单日志是一个版本编辑记录序列。版本编辑记录类型由编辑标识号标识。

We use the following datatypes for encoding/decoding.
我们使用以下数据类型进行编码/解码。

Data Types

Simple data types

VarX   - Variable character encoding of intX
FixedX - Fixed character encoding of intX

Complex data types

String - Length prefixed string data
+-----------+--------------------+
| size (n)  | content of string  |
+-----------+--------------------+
|<- Var32 ->|<-- n            -->|

Version Edit Record Format

Version Edit records have the following format. The decoder identifies the record type using the record identification number.
版本编辑记录的格式如下:解码器使用所述记录标识号标识所述记录类型。

+-------------+------ ......... ----------+
| Record ID   | Variable size record data |
+-------------+------ .......... ---------+
<-- Var32 --->|<-- varies by type       -->

Version Edit Record Types and Layout

There are a variety of edit records corresponding to different state changes of RocksDB.
RocksDB的不同状态变化对应着各种各样的编辑记录。

Comparator edit record:
比较器编辑记录:

Captures the comparator name

+-------------+----------------+
| kComparator | data           |
+-------------+----------------+
<-- Var32 --->|<-- String   -->|

Log number edit record:
日志号编辑记录:

Latest WAL log file number

+-------------+----------------+
| kLogNumber  | log number     |
+-------------+----------------+
<-- Var32 --->|<-- Var64    -->|

Previous File Number edit record:
之前的文件号编辑记录:

Previous manifest file number

+------------------+----------------+
| kPrevFileNumber  | log number     |
+------------------+----------------+
<-- Var32      --->|<-- Var64    -->|

Next File Number edit record:
下一个文件号编辑记录:

Next manifest file number

+------------------+----------------+
| kNextFileNumber  | log number     |
+------------------+----------------+
<-- Var32      --->|<-- Var64    -->|

Last Sequence Number edit record:
最后序列号编辑记录:

Last sequence number of RocksDB

+------------------+----------------+
| kLastSequence    | log number     |
+------------------+----------------+
<-- Var32      --->|<-- Var64    -->|

Max Column Family edit record:
最大列族编辑记录:

Adjust the maximum number of family columns allowed.

+---------------------+----------------+
| kMaxColumnFamily    | log number     |
+---------------------+----------------+
<-- Var32         --->|<-- Var32    -->|

Deleted File edit record:
删除的文件编辑记录:

Mark a file as deleted from database.

+-----------------+-------------+--------------+
| kDeletedFile    | level       | file number  |
+-----------------+-------------+--------------+
<-- Var32     --->|<-- Var32 -->|<-- Var64  -->|

New File edit record:
新文件编辑记录:

Mark a file as newly added to the database and provide RocksDB meta information.
将一个文件标记为新添加到数据库的文件,并提供RocksDB元信息。

  • File edit record with compaction information
    文件编辑记录与压缩信息
+--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
| kNewFile4    | level       | file number  | file size  | smallest_key   | largest_key  | smallest_seqno | largest_seq_no |
+--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
|<-- var32  -->|<-- var32 -->|<-- var64  -->|<-  var64 ->|<-- String   -->|<-- String -->|<-- var64    -->|<-- var64    -->|

+--------------+------------------+---------+------+----------------+--------------------+---------+------------+
|  CustomTag1  | Field 1 size n1  | field1  | ...  |  CustomTag(m)  | Field m size n(m)  | field(m)| kTerminate |
+--------------+------------------+---------+------+----------------+--------------------+---------+------------+
<-- var32   -->|<-- var32      -->|<- n1  ->|      |<-- var32   - ->|<--    var32     -->|<- n(m)->|<- var32 -->|

Several Optional customized fields can be written there.
可以编写几个可选的定制字段。
The field has a special bit indicating that whether it can be safely ignored. This is for compatibility reason. A RocksDB older release may see a field it can't identify. Checking the bit, RocksDB knows whether it should stop opening the DB, or ignore the field.
该字段有一个特殊的位,表示是否可以安全地忽略它。这是出于兼容性的原因。RocksDB早前发行的版本可能会出现无法识别的领域。通过查看bit, RocksDB可以知道是应该停止打开DB,还是忽略该字段。

Several optional customized fields are supported:
支持几个可选的定制字段:

  • kNeedCompaction: Whether the file should be compacted to the next level.
    kNeedCompaction:文件是否应该被压缩到下一层。
  • kMinLogNumberToKeepHack: WAL file number that is still in need for recovery after this entry.
    kMinLogNumberToKeepHack:该条目之后仍然需要恢复的WAL文件号。
  • kPathId: The Path ID in which the file lives. This can't be ignored by an old release.
    kPathId:文件所在的路径ID。旧版本不能忽略这一点。
  • File edit record backward compatible
+--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
| kNewFile2    | level       | file number  | file size  | smallest_key   | largest_key  | smallest_seqno | largest_seq_no |
+--------------+-------------+--------------+------------+----------------+--------------+----------------+----------------+
<-- var32   -->|<-- var32 -->|<-- var64  -->|<-  var64 ->|<-- String   -->|<-- String -->|<-- var64    -->|<-- var64    -->|
  • File edit record with path information
+--------------+-------------+--------------+-------------+-------------+----------------+--------------+
| kNewFile3    | level       | file number  | Path ID     | file size   | smallest_key   | largest_key  |
+--------------+-------------+--------------+-------------+-------------+----------------+--------------+
|<-- var32  -->|<-- var32 -->|<-- var64  -->|<-- var32 -->|<-- var64 -->|<-- String   -->|<-- String -->|
+----------------+----------------+
| smallest_seqno | largest_seq_no |
+----------------+----------------+
<-- var64     -->|<-- var64    -->|

Column family status edit record:

Note the status of column family feature (enabled/disabled)

+------------------+----------------+
| kColumnFamily    | 0/1            |
+------------------+----------------+
<-- Var32      --->|<-- Var32    -->|

Column family add edit record:

Add a column family

+---------------------+----------------+
| kColumnFamilyAdd    | cf name        |
+---------------------+----------------+
<-- Var32         --->|<-- String   -->|

Column family drop edit record:

Drop all column family

+---------------------+
| kColumnFamilyDrop   |
+---------------------+
<-- Var32         --->|

Record as part of an atomic group (since RocksDB 5.16):
记录作为原子组的一部分(自RocksDB 5.16):

There are cases in which 'all-or-nothing', multi-column-family version change is desirable. For example, atomic flush ensures either all or none of the column families get flushed successfully, multiple column families external SST ingestion guarantees that either all or none of the column families ingest SSTs successfully. Since writing multiple version edits is not atomic, we need to take extra measure to achieve atomicity (not necessarily instantaneity from the user's perspective). Therefore we introduce a new record field kInAtomicGroup to indicate that this record is part of a group of Version Edits that follow the 'all-or-none' property. The format is as follows.
在某些情况下,需要进行“全有或全无”的多列系列版本更改。例如,原子刷新可确保全部或不成功刷新列族,多个列族外部SST摄取可确保全部或不成功摄取SST。因为编写多个版本编辑不是原子性的,所以我们需要采取额外的措施来实现原子性(从用户的角度来看不一定是即时性)。因此,我们引入了一个新的记录字段kInAtomicGroup,以表明该记录是遵循“全或无”属性的一组版本编辑的一部分。格式如下:

+-----------------+--------------------------------------------+
| kInAtomicGroup  | #remaining Version Edits in the same group |
+-----------------+--------------------------------------------+
|<--- Var32 ----->|<----------------- Var32 ------------------>|

During recovery, RocksDB buffers Version Edits of an atomic group without applying them until the last Version Edit of the atomic group is decoded successfully from the MANIFEST file. Then RocksDB applies all the Version Edits in this atomic group. RocksDB never applies partial atomic groups.
在恢复过程中,RocksDB会缓存原子组的版本编辑,直到原子组的最后一个版本编辑从MANIFEST文件中被成功解码为止。然后RocksDB应用了这个原子组中的所有版本编辑。RocksDB从不使用部分原子组。

Version Edit ignorable record types

We reserved a special bit in record type. If the bit is set, it can be safely ignored. And the safely ignorable record has a standard general format:
我们在记录类型中保留了一个特殊的位。如果设置了位,则可以安全地忽略它。安全可忽略的记录有一个标准的一般格式:

+---------+----------------+----------------+
|   kTag  | field length n |  fields ...    |
+--------------------------+----------------+
<- Var32->|<--  var32   -->|<---   n       >|

This is introduced in 6.0 and no customized ignorable record created yet.
这是在6.0中引入的,而且还没有创建定制的可忽略的记录。

The following types of Version Edits fall into the ignorable category.
以下类型的版本编辑属于可忽略的类别。

DB ID edit record: introduced since RocksDB 6.5. If options.write_dbid_to_manifest is true, then RocksDB writes the DB ID edit record to the MANIFEST file, besides storing in the IDENTITY file.
DB ID编辑记录:自RocksDB 6.5引入。如果选项。如果write_dbid_to_manifest为true,那么RocksDB将DB ID编辑记录写入MANIFEST文件,而不是存储在IDENTITY文件中。

+-----------+------------+
|   kDbId   |    db id   |
+-----------+------------+
|<- Var32 ->|<- String ->|
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,716评论 4 364
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,558评论 1 294
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 109,431评论 0 244
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,127评论 0 209
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,511评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,692评论 1 222
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,915评论 2 313
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,664评论 0 202
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,412评论 1 246
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,616评论 2 245
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,105评论 1 260
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,424评论 2 254
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,098评论 3 238
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,096评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,869评论 0 197
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,748评论 2 276
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,641评论 2 271

推荐阅读更多精彩内容