Lucene’s Practical Scoring Function

https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html

For multiterm queries, Lucene takes the Boolean model, TF/IDF, and the vector space model and combines them in a single efficient package that collects matching documents and scores them as it goes.

对于多个词构成的查询,Lucene 对匹配的结果集进行了封装,并且结合boolean模型,tf/idf,向量空间模型等条件,计算了它们的分值

如下的query dsl,就是一个multiterm query

GET /my_index/doc/_search
{
  "query": {
    "match": {
      "text": "quick fox"
    }
  }
}

在es内部,会被重写为如下的形式:

GET /my_index/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {"term": { "text": "quick" }},
        {"term": { "text": "fox"   }}
      ]
    }
  }
}

这个bool query,实现了boolean 模型,在这个例子中,只会命中包含了quick和fox的文档。

当文档匹配一个查询,lucene会为结合每个匹配的词,计算这个query的分值。这个计算分值的公式被称为practical 分值函数。它看起来很吓人,但是实际上,部分组件是你已经知道的了,下面会讨论一些新的元素。

score(q,d)  =    # is the relevance score of document d for query q.
            queryNorm(q)  # 归一化因子.
          · coord(q,d)    #协调因子.
          · ∑ (           # query中的每一个term的权重总和.
                tf(t in d)   #  
              · idf(t)²      
              · t.getBoost() 
              · norm(t,d)    
            ) (t in q)    

Query Normalization Factor

The query normalization factor (queryNorm) is an attempt to normalize a query so that the results from one query may be compared with the results of another.
使不同的query之间更好做比较。

它不能很好的工作,不用用它来比较不同query的结果。
This factor is calculated at the beginning of the query. The actual calculation depends on the queries involved, but a typical implementation is as follows:
这个因子,再query开始的时候计算,

queryNorm = 1 / √sumOfSquaredWeights 

sumOfSquaredWeights 的计算方法是,把query中的所有的term的idf值加起来,然后求平方

每个文档的归一化因子,是相同的,而且你不能改变它的值。不管你要做什么,你都可以忽略它

Query Coordination

协调因子是被用来奖励那些包含较高比例term的文档。意味着,文档中出现的terms越多,document匹配query的比例越大。

假设我们有一个query “quick brown fox”,每个term的权重是1.5,如果没有协调因子,分值是每个term的和

  • Document with fox → score: 1.5
  • Document with quick fox → score: 3.0
  • Document with quick brown fox → score: 4.5

协调因子将term权重,乘以文档中匹配的term的数量,并除以匹配的term总数。有了协调因子,分数变成如下的求法:

  • Document with fox → score: 1.5 * 1 / 3 = 0.5
  • Document with quick fox → score: 3.0 * 2 / 3 = 2.0
  • Document with quick brown fox → score: 4.5 * 3 / 3 = 4.5

协调因子的作用是,使包含3个term的文档,比包含2个词的文档更相关。

quick brown fox 将会被重写入一个bool query 像这样

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "quick" }},
        { "term": { "text": "brown" }},
        { "term": { "text": "fox"   }}
      ]
    }
  }
}

bool query 默认在所有的should条件上应用协调因子。可以禁用coordination,但是一般不这么做。query 协调因子的作用是正面的。当你使用bool query 来包装一些高级查询,比如匹配查询,使协调启用也很有意义。因为匹配了更多的条件,所以返回的都是和查询请求匹配度更高的文档。

然而,在一些高级用例中,禁用协调可能是有意义的。假设您正在寻找同义词jump、leap和hop。你不关心这些同义词有多少,因为它们都代表相同的概念。事实上,只有一个同义词很可能出现。这将是一个禁用协调因子的好例子:

GET /_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "term": { "text": "jump" }},
        { "term": { "text": "hop"  }},
        { "term": { "text": "leap" }}
      ]
    }
  }
}

当你使用同义词时,在query重写的时候, es内部自动为同义词禁用了coordination。多数情况下禁用协调是自动处理的; 你不需要担心。

Index-Time Field-Level Boosting

我们将讨论如何通过Query-Time Boosting 在查询时,提高一个字段的权重,使之比其他字段更为重要。还可以在索引时间向字段应用Boost,实际上,这个权重值应用于该域的每一个term,而不是应用于域本身。

To store this boost value in the index without using more space than necessary, this field-level index-time boost is combined with the field-length norm (see Field-length norm) and stored in the index as a single byte. This is the value returned by norm(t,d) in the preceding formula.

We strongly recommend against using field-level index-time boosts for a few reasons:

  • Combining the boost with the field-length norm and storing it in a single byte means that the field-length norm loses precision. The result is that Elasticsearch is unable to distinguish between a field containing three words and a field containing five words.
  • To change an index-time boost, you have to reindex all your documents. A query-time boost, on the other hand, can be changed with every query.
  • If a field with an index-time boost has multiple values, the boost is multiplied by itself for every value, dramatically increasing the weight for that field.

Query-time boosting is a much simpler, cleaner, more flexible option.Query-time boosting is a much simpler, cleaner, more flexible option.

With query normalization, coordination, and index-time boosting out of the way, we can now move on to the most useful tool for influencing the relevance calculation: query-time boosting.

推荐阅读更多精彩内容

  • 白状元坐在一旁,鼻子发酸,内心里莫名的出现了一种情绪,感动。“菲菲,你们姐妹有什么需要帮忙的地方,尽管找我。白状元...
    城主持剑阅读 204评论 0 4
  • 我认为孔子最令人敬佩之处便是他的“知其不可而为之”的殉道精神。“知其不可而为之”即明知做不到偏要去做,透露着一丝...
    芒茶酱阅读 940评论 0 3
  • 01 老六是心不甘情不愿地上城里来的。架不住六娘在家里不停地唠叨不住地哭天抢地。 “这可是你亲儿子诶,你们老曾家的...
    了了师太阅读 1,155评论 17 21