Apache Lucene - Index File Formats V7.3.0

Apache Lucene - Index File Formats(索引文件格式)


Introduction(引言)

This document defines the index file formats used in this version of Lucene. If you are using a different version of Lucene, please consult the copy of docs/ that was distributed with the version you are using.

本文档定义了V7.0.0版本Lucene中使用的索引文件格式。如果您使用的是不同版本的Lucene,请查阅docs/随您使用的版本一起发布的副本。

This document attempts to provide a high-level definition of the Apache Lucene file formats.

本文档尝试提供对Apache Lucene文件格式的高级定义。

Definitions(定义)

The fundamental concepts in Lucene are index, document, field and term.

Lucene的基本概念包括索引(index),文档(document),域(field)和词(term)。

An index contains a sequence of documents.

索引包含一系列文档。

  • A document is a sequence of fields.

  • 文档(document)是一系列域(field)。

  • A field is a named sequence of terms.

  • 域(field)是一系列经过命名的词(term)。

  • A term is a sequence of bytes.

  • 词(term)是一系列字节(byte)。

The same sequence of bytes in two different fields is considered a different term. Thus terms are represented as a pair: the string naming the field, and the bytes within the field.

两个不同域(field)中的相同字节(byte)序列被认为是不同的词(term)。因此,词由一对(要素)表示:域(field)名、域(field)内字节组(bytes)。

Inverted Indexing(倒排索引)

The index stores statistics about terms in order to make term-based search more efficient. Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it. This is the inverse of the natural relationship, in which documents list terms.

(倒排索引即)存储与词(term)相关的统计信息的索引(index),其目的是提高基于词(term based)的搜索效率。Lucene的索引(index)属于倒排索引的索引族。(叫做倒排索引)是因为对于一个词(term),此索引(index)可以存储包含此词(term)的文档列表(document list)。这与某个文件(document)存储一些词列表(term list)的自然顺序相反。

Types of Fields(域(field)的类型)

In Lucene, fields may be stored, in which case their text is stored in the index literally, in a non-inverted manner. Fields that are inverted are called indexed. A field may be both stored and indexed.

在Lucene中,域(field)可以被存储(store),此时,域(field)内的文本(text)会依照原样,以非倒转(non-inverted)的方式存入索引中。被倒转(inverted)的域(field)称作被索引的(indexed)。一个域可以同时被存储(store)和索引(index)。

The text of a field may be tokenized into terms to be indexed, or the text of a field may be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is useful for certain identifier fields to be indexed literally.

在域(field)中的文本被索引(index)时,文本既可以被分词(tokenize)为词(term),也可以(不经过分词)依原样(literally)作为词(term)被索引(index)。大多数域(field)都是经过分词(tokenize)的,但有时对于特定标识符域(field),按照字面意思(literally)来索引(index)很有效。

See the Field java docs for more information on Fields.

有关域的更多信息,请查阅java docs文档。

Segments(段)

Lucene indexes may be composed of multiple sub-indexes, or segments. Each segment is a fully independent index, which could be searched separately. Indexes evolve by:

Lucene的索引(index)可能有多个子索引或段(segment)构成。每个段(segment)是一个完全独立的索引(index),可以单独搜索。索引(index)演变为:

  1. Creating new segments for newly added documents.
  2. Merging existing segments.
  1. 为新添加的文档(document)创建新段(segment)。
  2. 合并已有段(segment)。

Searches may involve multiple segments and/or multiple indexes, each index potentially composed of a set of segments.

搜索(search)可能涉及多个段(segment)和/或多个索引(index),每个索引(index)可能由一系列段(segment)构成。

Document Numbers(文档编号)

Internally, Lucene refers to documents by an integer document number. The first document added to an index is numbered zero, and each subsequent document added gets a number one greater than the previous.

在内部,Lucene通过整型(integer)文档(document number)编号引用文档(index)。添加到索引(index)中的首个文档(document)编号为0,后续添加的文档(document)其编号依次增长1。

Note that a document's number may change, so caution should be taken when storing these numbers outside of Lucene. In particular, numbers may change in the following situations:

注意,文档(document)的编号可能会改变,因此在Lucene外部存储文档编号(document number)时应谨慎。尤其是在以下情况中,编号可能会改变。

  • The numbers stored in each segment are unique only within the segment, and must be converted before they can be used in a larger context. The standard technique is to allocate each segment a range of values, based on the range of numbers used in that segment. To convert a document number from a segment to an external value, the segment's base document number is added. To convert an external value back to a segment-specific value, the segment is identified by the range that the external value is in, and the segment's base value is subtracted. For example two five document segments might be combined, so that the first segment has a base value of zero, and the second of five. Document three from the second segment would have an external value of eight.

  • 每段(segment)中存储的编号仅在本段(segment)内唯一,在更大的上下文(context)中使用前必须进行转换。标准技术依据每段(segment)使用的数值范围为其分配一个范围的值。要将文档编号(document number)从段(segment)内转换为一个外部值,需要给(这个编号)添加段(segment)内基本文档编号(document number)。要将外部值转换为特定段(segment-specific)的值,此段(segment)需检查外部值是否在数值范围内,并(使外部值)减去段(segment)的基本文档编号。例如,两个五篇文档(document)的段(segment)将要合并(merge),那么第一个段(segment)的基本值为0,第二个的基本值为5(之前的文档里值的总数,第一段合并前文档中总数为0,第二段合并前,第一段已合并完,文档总数为5)。第二段(segment)的3号文档(document)的外部值将为8(3+5)。

  • When documents are deleted, gaps are created in the numbering. These are eventually removed as the index evolves through merging. Deleted documents are dropped when segments are merged. A freshly-merged segment thus has no gaps in its numbering.

  • 当文档(delete)被删除(delete),编号中会产生空白。随着索引(index)的通过合并(merge)的演变,这些(空白)最终会被移除(remove)。被删除的文档在段(segment)合并(merge)时会被物理删除(drop)。新合并(merge)的段(segment)编号中就没有空白了。

Index Structure Overview(索引结构概述)

Each segment index maintains the following:

每段(segment)索引(index)都包含以下内容:

  • Segment info. This contains metadata about a segment, such as the number of documents, what files it uses.

  • 段元数据:包含段(segment)的元数据,例如文档数量,使用哪些文件。

  • Field names. This contains the set of field names used in the index.

  • 域名:包含索引(index)中使用的一些列域(field)名。

  • Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.

  • 域存储值:对于每篇文档(document),其包含属性-值对的列表,其中属性是域(field)名。这被用来存储文档(document)的辅助数据,如标题,url,访问数据库的标识符。这一系列域(field)就是搜索(search)时每次命(hit)后要返回的内容。以文档(document)编号作为关键字。

  • Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.

  • 词典:包含所有文档(document)中所有被索引(indexed)域(field)使用的所有词(term)的字典(dictionary)。该字典(dictionary)还包含了含有该词(term)的文档(document)数量,以及指向词频(term frequency)、位置信息(proximity data)的指针。

  • Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)

  • 词频数据:对于字典(dictionary)中每个词(term),存储包含此词(term)的文档(document)数量(df),及其在每篇文档(document)的词频(frequency)(tf)。条件是IndexOptions.DOCS_ONLY不为omitted。

  • Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.

  • 词位置数据:对于字典(dictionary)中每个词(term),存储其在每篇文档(document)中的出现的位置(position)。注意,如果每篇文档(document)的每个域(field)都忽略(omit)位置(position)信息,则此节点不生成。

  • Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.

  • 标准化因子:对于文档(document)中的每个域(field),存储一个值,此值将会与域上(field)命中(hit)的得分(score)相乘。

  • Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors。

  • 词向量:对于文档(document)中的每个域(field),可以存储词向量(term vector)(有时称作文档向量(document vector))。词向量(term vector)由词(term)的文本和词频(term frequency)组成。向索引(index)中添加词向量(term vector),请参阅Field的构造器。

  • Per-document values. Like stored values, these are also keyed by document number, but are generally intended to be loaded into main memory for fast access. Whereas stored values are generally intended for summary results from searches, per-document values are useful for things like scoring factors.

  • 单个文档值:与存储值相似,单个文档(document)值也以文档编号作为关键字,但其通常加载到主内存中以便快速访问。相比与存储值通常用作搜索结果的摘要,单个文档值对于打分元素(scoring factors)这类过程非常有用。

  • Live documents. An optional file indicating which documents are live.

  • 活跃文档:一个可选文件,找出哪些文档(document)在活跃状态。

  • Point values. Optional pair of files, recording dimensionally indexed fields, to enable fast numeric range filtering and large numeric values like BigInteger and BigDecimal (1D) and geographic shape intersection (2D, 3D).

  • 关键点值:一个可选文件,记录维度级的被索引域,用于启用快速数值范围过滤器和BigInteger、BigDecimal(1D)、地理形状交叉(geographic shape intersection) (2D, 3D)这样的大数值。

Details on each of these are provided in their linked pages.

每处链接的详细信息都在链接页面中提供。

File Naming(文件命名)

All files belonging to a segment have the same name with varying extensions. The extensions correspond to the different file formats described below. When using the Compound File format (default for small segments) these files (except for the Segment info file, the Lock file, and Deleted documents file) are collapsed into a single .cfs file (see below for details)

同段(segment)中所有文件拥有相同命名与不同扩展名。扩展名对应下方对不同文件格式的描述。使用复合文件格式(默认受众为小的段(segment))时,这些文件(除段元数据文件、锁文件和删除文档文件)被折叠为单个.cfs文件()。

Typically, all segments in an index are stored in a single directory, although this is not required.

通常,一个索引(index)下的所有段(segment)存储在单个目录(dierctory)中,尽管这不是必须的。

File names are never re-used. That is, when any file is saved to the Directory it is given a never before used filename. This is achieved using a simple generations approach. For example, the first segments file is segments_1, then segments_2, etc. The generation is a sequential long integer represented in alpha-numeric (base 36) form.

文件名不会复用。也就是说,当任何文件被保存到目录时,它会被赋予一个从未使用过的文件名。这是由简单的层代(generation)方法实现的。例如,第一个段(segment)的文件为segments_1,然后是segments_2,以此类推。层代(generation)是一系列基于数字(base 36)-字母表的顺序长整数。

Summary of File Extensions(文件扩展名摘要)

The following table summarizes the names and extensions of the files in Lucene:

下方表格总结了Lucene中的文件名和扩展名:

Name(名称) Extension(扩展名) Brief Description(简介)
Segments File segments_N Stores information about a commit point.
存储提交点的信息
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
写锁可以防止多个IndexWriter写入同一文件
Segment Info .si Stores metadata about a segment.
存储段(segment)的元数据
Compound File .cfs, .cfe An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.
一个可选的"虚拟"文件,包含经常用光的系统内所有其他索引(index)文件
Fields .fnm Stores information about the fields.
存储域(field)信息
Field Index .fdx Contains pointers to field data.
存储指向域(field)数据的指针
Field Data .fdt The stored fields for documents.
存储的文档(document)域(field)
Term Dictionary .tim The term dictionary, stores term info.
词典(term dictionary),存储词(term)信息
Term Index .tip The index into the Term Dictionary.
词典(term dictionary)中的索引(index)
Frequencies .doc Contains the list of docs which contain each term along with frequency.
包含每个词(term)及词频(term fuequency)的文档列表
Positions .pos Stores position information about where a term occurs in the index.
存储词(term)在索引(index)中出现位置信息
Payloads .pay Stores additional per-position metadata information such as character offsets and user payloads.
存储额外的位置元数据信息,如字符串的偏移量和用户有效载荷
Norms .nvd, .nvm Encodes length and boost factors for docs and fields.
编码长度和文档(document)及域(field)的提升因素
Per-Document Values .dvd, .dvm Encodes additional scoring factors or other per-document information.
编码附加打分因子(score factors)和其他每篇文档(document)信息
Term Vector Index .tvx Stores offset into the document data file.
存储到文档(document)数据文件中的偏移量(offset)
Term Vector Data .tvd Contains term vector data.
包含词向量(term vector)数据
Live Documents .liv Info about what documents are live.
哪些文档(document)是活跃的信息
Point values .dii, .dim Holds indexed points, if any.
保留被索引的点,如果有的话

Lock File(锁文件)

The write lock, which is stored in the index directory by default, is named "write.lock". If the lock directory is different from the index directory then the write lock will be named "XXXX-write.lock" where XXXX is a unique prefix derived from the full path to the index directory. When this file is present, a writer is currently modifying the index (adding or removing documents). This lock file ensures that only one writer is modifying the index at a time.

默认存储在索引(index)目录(directory)中的写入锁(lock),叫做"write.lock"。如果锁目录域与索引(index)目录不同,则写入锁被命名为"XXXX-write.lock","XXXX"是从索引(index)目录的完整路径派生的唯一前缀(prefix)。当文件存在时,一个写入程序(writer)正在修改索引(index)(添加或删除文档(document))。此锁文件确保一次只有一个写程序(writer)在修改索引(index)。

History(历史)

Compatibility notes are provided in this document, describing how file formats have changed from prior versions:

本文档提供了兼容性说明,描述了以前版本中的文件格式是如何变化的:

  • In version 2.1, the file format was changed to allow lock-less commits (ie, no more commit lock). The change is fully backwards compatible: you can open a pre-2.1 index for searching or adding/deleting of docs. When the new segments file is saved (committed), it will be written in the new file format (meaning no specific "upgrade" process is needed). But note that once a commit has occurred, pre-2.1 Lucene will not be able to read the index.

  • 在V2.1中,文件格式被调成为允许无锁提交(即,不再提交锁)。这些变更完全向后兼容:您可以打开2.1之前的索引搜索或添加/删除文档。当新的文档段文件被保存(提交)时,它将被写入新的文件格式(意味着不需要特定的"升级"过程)。但请注意,一旦发生提交,2.1之前的Lucene将无法读取索引。

  • In version 2.3, the file format was changed to allow segments to share a single set of doc store (vectors & stored fields) files. This allows for faster indexing in certain cases. The change is fully backwards compatible (in the same way as the lock-less commits change in 2.1).

  • 在V2.3,文档格式被调整为允许段共享一组文档存储(向量和存储域)文件。这在某些情况下允许更快的索引。这种变化是完全向后兼容的(与2.1中的无锁提交相同)。

  • In version 2.4, Strings are now written as true UTF-8 byte sequence, not Java's modified UTF-8. See LUCENE-510 for details.

  • 在V2.4中,现在字符串以真正的UTF-8字节序列写入,而不是Java的经过修改的UTF-8。详情见LUCENE-510

  • In version 2.9, an optional opaque Map<String,String> CommitUserData may be passed to IndexWriter's commit methods (and later retrieved), which is recorded in the segments_N file. See LUCENE-1382 for details. Also, diagnostics were added to each segment written recording details about why it was written (due to flush, merge; which OS/JRE was used; etc.). See issue LUCENE-1654 for details.

  • 在V2.9中,一个可选的不透明Map<String,String>CommituserData可以被传递给IndexWriter提交方法(以后被提取),它被记录在segments_N文件中。细节请参见问题LUCENE-1654

  • In version 3.0, compressed fields are no longer written to the index (they can still be read, but on merge the new segment will write them, uncompressed). See issue LUCENE-1960 for details.

  • 在版本3.0中,压缩字段不再写入索引(它们仍然可以被读取,但在合并时,新的段会写入它们,未压缩)。详情请参阅LUCENE-1960

  • In version 3.1, segments records the code version that created them. See LUCENE-2720 for details. Additionally segments track explicitly whether or not they have term vectors. See LUCENE-2811 for details.

  • 在版本3.1中,段记录了创建它们的代码版本。细节见 LUCENE-2720。此外段明确跟踪是否有术语向量。 细节见LUCENE-2811

  • In version 3.2, numeric fields are written as natively to stored fields file, previously they were stored in text format only.

  • 在版本3.2中,数字字段原生写入存储的字段文件,以前它们仅以文本格式存储。

  • In version 3.4, fields can omit position data while still indexing term frequencies.
    在版本3.4中,字段可以省略位置数据,同时仍然对术语频率编制索引。

  • In version 4.0, the format of the inverted index became extensible via the Codec api. Fast per-document storage (DocValues) was introduced. Normalization factors need no longer be a single byte, they can be any NumericDocValues. Terms need not be unicode strings, they can be any byte sequence. Term offsets can optionally be indexed into the postings lists. Payloads can be stored in the term vectors.

  • 在版本4.0中,倒排索引的格式通过Codecapi 变得可扩展。快速的每文档存储(DocValues)被引入。规范化因素不再需要一个字节,它们可以是任何一个字节NumericDocValues。术语不必是unicode字符串,它们可以是任何字节序列。术语偏移量可以选择性地编入发布列表中。有效载荷可以存储在术语向量中。

  • In version 4.1, the format of the postings list changed to use either of FOR compression or variable-byte encoding, depending upon the frequency of the term. Terms appearing only once were changed to inline directly into the term dictionary. Stored fields are compressed by default.
    在版本4.1中,发布列表的格式更改为使用FOR压缩或可变字节编码,具体取决于术语的频率。仅出现一次的术语被改为直接内联到术语词典中。存储字段默认为压缩字段。

  • In version 4.2, term vectors are compressed by default. DocValues has a new multi-valued type (SortedSet), that can be used for faceting/grouping/joining on multi-valued fields.

  • 在版本4.2中,术语向量是默认压缩的。DocValues有一个新的多值类型(SortedSet),可用于在多值字段上进行分组/合并。

  • In version 4.5, DocValues were extended to explicitly represent missing values.

  • 在版本4.5中,DocValues被扩展为显式表示缺失值。

  • In version 4.6, FieldInfos were extended to support per-field DocValues generation, to allow updating NumericDocValues fields.

  • 在4.6版本中,FieldInfos已扩展为支持每字段DocValues生成,以允许更新NumericDocValues字段。

  • In version 4.8, checksum footers were added to the end of each index file for improved data integrity. Specifically, the last 8 bytes of every index file contain the zlib-crc32 checksum of the file.

  • 在版本4.8中,校验和页脚被添加到每个索引文件的末尾以提高数据完整性。特别是,每个索引文件的最后8个字节都包含文件的zlib-crc32校验和。

  • In version 4.9, DocValues has a new multi-valued numeric type (SortedNumeric) that is suitable for faceting/sorting/analytics.

  • 在版本4.9中,DocValues具有适用于分面/排序/分析的新多值数值类型(SortedNumeric)。

  • In version 5.4, DocValues have been improved to store more information on disk: addresses for binary fields and ord indexes for multi-valued fields.

  • 在版本5.4中,DocValues已得到改进,可在磁盘上存储更多信息:二进制字段的地址和多值字段的ord索引。

  • In version 6.0, Points were added, for multi-dimensional range/distance search.

  • 在版本6.0中,添加了点,用于多维范围/距离搜索。

  • In version 6.2, new Segment info format that reads/writes the index sort, to support index sorting.

  • 在6.2版本中,读取/写入索引排序的新段信息格式支持索引排序。

  • In version 7.0, DocValues have been improved to better support sparse doc values thanks to an iterator API.

  • 在版本7.0中,由于使用了迭代器API,DocValues得到了改进以更好地支持稀疏文档值。

Limitations(限制)

Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either UInt64 values, or better yet, VInt values which have no limit.

Lucene使用Java int来引用文档编号,索引文件格式使用Int32 磁盘上的文档编号。这是索引文件格式和当前实现的限制。最终,这些应该被替换为任何UInt64值,或者更好的VInt是没有限制的值。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 158,560评论 4 361
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,104评论 1 291
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,297评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,869评论 0 204
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,275评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,563评论 1 216
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,833评论 2 312
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,543评论 0 197
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,245评论 1 241
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,512评论 2 244
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,011评论 1 258
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,359评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,006评论 3 235
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,062评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,825评论 0 194
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,590评论 2 273
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,501评论 2 268

推荐阅读更多精彩内容