文章

Elasticsearch:自定义得分

Elasticsearch的自定义打分策略。

  1. bool
  2. nested
  3. parent-child
  4. function score
    1. 和nested结合
      1. function_score + nested
      2. 叠加父文档得分
    2. 和parent-child结合
  5. 翻译成Java代码

bool

bool查询的语法就是query关键字下嵌套一个bool

bool query的逻辑比较简单:

  • must/should:其下的查询分纳入总分;
  • filter/must_not:其下的查询分重置为0;

所以bool的总分就是所有mustshould下的子查询的分值加和。

The bool query takes a more-matches-is-better approach, so the score from each matching must or should clause will be added together to provide the final _score for each document.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user.id" : "kimchy" }
      },
      "filter": {
        "term" : { "tags" : "production" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [
        { "term" : { "tags" : "env1" } },
        { "term" : { "tags" : "deployed" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

nested

nested查询的语法就是query关键字下嵌套一个nested

先创建一个含有nested field的索引:一个manager下有多个employee,大家都只有name和age

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
PUT nested-properties
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "first": {
        "type": "text"
      },
      "second": {
        "type": "text"
      },
      "manager": {
        "properties": {
          "age":  { "type": "integer" },
          "name": { "type": "text"  }
        }
      },
      "employees": {
        "type": "nested",
        "properties": { 
          "age":  { "type": "integer" },
          "name": { "type": "text"  }
        }
      }
    }
  }
}

放三条数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
PUT nested-properties/_doc/1 
{
  "first": "hello",
  "second": "world",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

# object类型可以双重赋值?只是因为作为数组存储了
PUT nested-properties/_doc/2
{
  "first": "hello",
  "second": "kugou",
  "manager": {
    "name": "Henry White",
    "age": 30
  },
  "manager.name": "Bob White",
  "manager.age": 40,
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

# update
POST nested-properties/_update/1
{
  "doc": {
    "employees": [
    {
      "name": "liuhaibo",
      "age": 12345
    },
    {
      "name": "pikachu",
      "age": 1234
    },
    {
      "name": "pikachu",
      "age": 4321
    },
    {
      "name": "pikachu",
      "age": 1111
    }
  ]
  }
}

PUT nested-properties/_doc/3
{
  "first": "hello",
  "second": "kugou",
  "manager": {
    "name": "Henry White",
    "age": 30
  },
  "manager.name": "Bob White",
  "manager.age": 40,
  "employees": [
    {
      "name": "Liuhaibo",
      "age": 12345
    },
    {
      "name": "pikachu",
      "age": 1234
    }
  ]
}

nested query的计分本身比较简单:查询到子文档后,使用score_mode把子文档的得分(默认是avg,求平均)汇总为父文档得分即可

parent-child

父子文档有三种查询方式:

根据子文档查询父文档(类似nested query)的语法就是query关键字下嵌套一个has_child

根据父文档查询子文档的语法就是query关键字下嵌套一个nested

查询一个父文档的所有子文档的语法就是query关键字下嵌套一个parent_id

has_childnested类似,所以都使用score_mode将子文档得分叠加为父文档得分,不同的是它的默认值为none,默认不拿子文档给父文档计分

has_parent使用score将父文档得分叠加为子文档得分,默认为false,不计分。

function score

function_score查询的语法就是query关键字下嵌套一个function_score

至此,所有上述的单独查询的计分规则都很简单。但是,如果我们想自定义查询的得分,多半需要使用function score查询来对得分进行多重修改,事情就变得复杂起来。

function score查询本身稍显复杂:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
GET /_search
{
  "query": {
    "function_score": {
      "query": { "match_all": {} },
      "boost": "5", 
      "functions": [
        {
          "filter": { "match": { "test": "bar" } },
          "random_score": {}, 
          "weight": 23
        },
        {
          "filter": { "match": { "test": "cat" } },
          "weight": 42
        }
      ],
      "max_boost": 42,
      "score_mode": "max",
      "boost_mode": "multiply",
      "min_score": 42
    }
  }
}

它主要由两部分组成,所以得分主要来源于两部分:

  • query的得分:function_score下可以指定一个query,代表本次打分涉及到的文档;
  • functions:函数得分。可以指定不止一个function,使用函数进行额外计分。
    • 每个函数也可以指定一个filter,代表打分的文档范围,可以和上边的query查询范围不同。因为是filter,所以不计分;
    • 具体打分函数有多种,可以实现相对复杂的公式;
    • 每个函数的得分可以通过weight独立加权;

最后要考虑的是怎么把这两部分得分合并起来:

  • score_mode:用于合并函数之间的得分,默认是multiply。函数总得分可以使用max_boost修正一下,作为得分上限;
  • boost_mode:用于合并query和函数总得分之间的得分,默认是multiply。最终总得分可以通过min_score限制一下,低于该得分的文档被抛弃;

听起来已经很复杂了,但是如果再叠加上nested查询或parent-child查询,理解起来会更麻烦,有一种无限套娃的感觉。

和nested结合

以上面的nested索引为例。当前共有三条文档:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "nested-properties",
        "_id": "2",
        "_score": 1,
        "_source": {
          "first": "hello",
          "second": "kugou",
          "manager": {
            "name": "Henry White",
            "age": 30
          },
          "manager.name": "Bob White",
          "manager.age": 40,
          "employees": [
            {
              "name": "John Smith",
              "age": 34
            },
            {
              "name": "Peter Brown",
              "age": 26
            }
          ]
        }
      },
      {
        "_index": "nested-properties",
        "_id": "1",
        "_score": 1,
        "_source": {
          "first": "hello",
          "second": "world",
          "manager": {
            "name": "Alice White",
            "age": 30
          },
          "employees": [
            {
              "name": "liuhaibo",
              "age": 12345
            },
            {
              "name": "pikachu",
              "age": 1234
            },
            {
              "name": "pikachu",
              "age": 4321
            },
            {
              "name": "pikachu",
              "age": 1111
            }
          ]
        }
      },
      {
        "_index": "nested-properties",
        "_id": "3",
        "_score": 1,
        "_source": {
          "first": "hello",
          "second": "kugou",
          "manager": {
            "name": "Henry White",
            "age": 30
          },
          "manager.name": "Bob White",
          "manager.age": 40,
          "employees": [
            {
              "name": "Liuhaibo",
              "age": 12345
            },
            {
              "name": "pikachu",
              "age": 1234
            }
          ]
        }
      }
    ]
  }
}

我们要实现如下自定义打分:

  • 找到manager名为alice的文档;
  • 选择它下面名为pikachu的employee;
  • 所有的employee以age作为打分依据,但是要经过一些函数变换;
  • manager也以age作为打分依据,也要经过一些函数变换;
  • 最终合并manager和employee的得分,作为文档总得分;

因为得分点过多,为了搞清楚每一个分值来自于哪里,可以使用explain API查看具体某一个文档的详细得分信息。

function_score + nested

先看怎么给nested文档打分。

文档1有三个pikachu嵌套子文档,age分别为1234、4321、1111。

查到一个二者结合进行打分的回答。但是如果这么组合,会发现得分很奇怪:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
GET nested-properties/_explain/1
{
  "query": {
    "function_score": {
      "query": {
        "nested": {
          "path": "employees",
          "query": {
            "bool": {
              "must": [
                {
                  "match": {
                    "employees.name": "pikachu"
                  }
                }
              ]
            }
          }
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "employee.age",
            "factor": 1.1,
            "modifier": "square",
            "missing": 1
          }
        }
      ]
    }
  }
}

解释详情如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
{
  "_index": "nested-properties",
  "_id": "1",
  "matched": true,
  "explanation": {
    "value": 0.9134444,
    "description": "sum of:",
    "details": [
      {
        "value": 0.91344434,
        "description": "function score, product of:",
        "details": [
          {
            "value": 0.7549127,
            "description": "Score based on 3 child docs in range from 3 to 6, best match:",
            "details": [
              {
                "value": 0.7549127,
                "description": "weight(employees.name:pikachu in 1) [PerFieldSimilarity], result of:",
                "details": [
                  {
                    "value": 0.7549127,
                    "description": "score(freq=1.0), computed as boost * idf * tf from:",
                    "details": [
                      {
                        "value": 2.2,
                        "description": "boost",
                        "details": []
                      },
                      {
                        "value": 0.6931472,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                          {
                            "value": 4,
                            "description": "n, number of documents containing term",
                            "details": []
                          },
                          {
                            "value": 8,
                            "description": "N, total number of documents with field",
                            "details": []
                          }
                        ]
                      },
                      {
                        "value": 0.49504948,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                          {
                            "value": 1,
                            "description": "freq, occurrences of term within document",
                            "details": []
                          },
                          {
                            "value": 1.2,
                            "description": "k1, term saturation parameter",
                            "details": []
                          },
                          {
                            "value": 0.75,
                            "description": "b, length normalization parameter",
                            "details": []
                          },
                          {
                            "value": 1,
                            "description": "dl, length of field",
                            "details": []
                          },
                          {
                            "value": 1.25,
                            "description": "avgdl, average length of field",
                            "details": []
                          }
                        ]
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 1.21,
            "description": "min of:",
            "details": [
              {
                "value": 1.21,
                "description": "field value function: square(doc['employee.age'].value?:1.0 * factor=1.1)",
                "details": []
              },
              {
                "value": 3.4028235e+38,
                "description": "maxBoost",
                "details": []
              }
            ]
          }
        ]
      },
      {
        "value": 0,
        "description": "match on required clause, product of:",
        "details": [
          {
            "value": 0,
            "description": "# clause",
            "details": []
          },
          {
            "value": 1,
            "description": "FieldExistsQuery [field=_primary_term]",
            "details": []
          }
        ]
      }
    ]
  }
}

function score得分为0.91344434,主要由两部分组成:

  1. 第一部分是bool里must下使用term query查询pikachu,所以term的得分(tfidf)保留了下来,一共0.7549127;
  2. 第二部分得分为1.21,打分公式为none(doc['employee.age'].value?:1.0 * factor=1.0)=1.21;

第二部分为什么是1.21?employee.age是有值的,但是似乎没有用到。很明显,1.21=(1*1.1)^2,所以没有用employee.age的实际值,而是用了默认值1(missing=1)。为什么?

看到这个回答才发现,上面的回答有误,查询的时候把嵌套层级搞错了。function score内部的query为nested查询,也就是说function score查询在nested查询外面,在nested查询外面是不可能访问到nested属性的。所以上面的查询实际是在查完nested子文档得分之后,再拿父文档的employee.age计算的分,但是父文档只有manager.age,哪来的employee.age?因此使用了missing value=1。

调一下两个查询的嵌套层级,整个function_score查询放到nested的query里,在nested query里面才能用里面的field打分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
GET nested-properties/_explain/1
{
  "query": {
    "nested": {
      "path": "employees",
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "filter": {
                "match": {
                  "employees.name": "pikachu"
                }
              }
            }
          },
          "functions": [
            {
              "field_value_factor": {
                "field": "employees.age",
                "factor": 1,
                "modifier": "none",
                "missing": 1
              }
            }
          ]
        }
      }
    }
  }
}

因为不关心tfidf的得分,只是想做个过滤,所以这次使用了filter。同时employee.age的打分规则也简化一下,不乘以1.1再平方了,就是用原始值。

结果答案更奇怪了,零分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{
  "_index": "nested-properties",
  "_id": "1",
  "matched": true,
  "explanation": {
    "value": 0,
    "description": "Score based on 3 child docs in range from 3 to 6, best match:",
    "details": [
      {
        "value": 0,
        "description": "sum of:",
        "details": [
          {
            "value": 0,
            "description": "function score, product of:",
            "details": [
              {
                "value": 0,
                "description": "ConstantScore(employees.name:pikachu)^0.0",
                "details": []
              },
              {
                "value": 1234,
                "description": "min of:",
                "details": [
                  {
                    "value": 1234,
                    "description": "field value function: none(doc['employees.age'].value?:1.0 * factor=1.0)",
                    "details": []
                  },
                  {
                    "value": 3.4028235e+38,
                    "description": "maxBoost",
                    "details": []
                  }
                ]
              }
            ]
          },
          {
            "value": 0,
            "description": "match on required clause, product of:",
            "details": [
              {
                "value": 0,
                "description": "# clause",
                "details": []
              },
              {
                "value": 1,
                "description": "_nested_path:employees",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

显然,0分来自filter。仔细想想function score的计分方式:默认是query的得分乘以functions的得分。所以乘完得分也为0了。

因此应该给function score设置"boost_mode": "sum",求和:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
GET nested-properties/_explain/1
{
  "query": {
    "nested": {
      "path": "employees",
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "filter": {
                "match": {
                  "employees.name": "pikachu"
                }
              }
            }
          },
          "functions": [
            {
              "field_value_factor": {
                "field": "employees.age",
                "factor": 1,
                "modifier": "none",
                "missing": 1
              }
            }
          ],
          "boost_mode": "sum"
        }
      }
    }
  }
}

依然不太对:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{
  "_index": "nested-properties",
  "_id": "1",
  "matched": true,
  "explanation": {
    "value": 2222,
    "description": "Score based on 3 child docs in range from 3 to 6, best match:",
    "details": [
      {
        "value": 4321,
        "description": "sum of:",
        "details": [
          {
            "value": 4321,
            "description": "sum of",
            "details": [
              {
                "value": 0,
                "description": "ConstantScore(employees.name:pikachu)^0.0",
                "details": []
              },
              {
                "value": 4321,
                "description": "min of:",
                "details": [
                  {
                    "value": 4321,
                    "description": "field value function: none(doc['employees.age'].value?:1.0 * factor=1.0)",
                    "details": []
                  },
                  {
                    "value": 3.4028235e+38,
                    "description": "maxBoost",
                    "details": []
                  }
                ]
              }
            ]
          },
          {
            "value": 0,
            "description": "match on required clause, product of:",
            "details": [
              {
                "value": 0,
                "description": "# clause",
                "details": []
              },
              {
                "value": 1,
                "description": "_nested_path:employees",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

"Score based on 3 child docs in range from 3 to 6, best match:"说明explain api只给了其中分最高的一个子文档的得分,4321,符合预期,但是最终得分为2222,而非4321+1234+1111。再仔细看,2222是三个数的均值……想起来了,nested查询默认给所有子文档求平均得分,作为父文档的总得分。

所以给所有子文档得分求和,修改nested查询的属性,"score_mode": "sum"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
GET nested-properties/_explain/1
{
  "query": {
    "nested": {
      "path": "employees",
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "filter": {
                "match": {
                  "employees.name": "pikachu"
                }
              }
            }
          },
          "functions": [
            {
              "field_value_factor": {
                "field": "employees.age",
                "factor": 1,
                "modifier": "none",
                "missing": 1
              }
            }
          ],
          "boost_mode": "sum"
        }
      },
      "score_mode": "sum"
    }
  }
}

这次得分对了,三个pikachu的age和叠加,值为6666。

叠加父文档得分

再叠加父属性(manager.age)怎么搞?

显然要再嵌套一层!为了防止绕晕,先把最外层的查询框架写出来

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
GET nested-properties/_search
{
  "query": {
    "function_score": {
      "query": {
        
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "manager.age"
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "sum"
    }
  }
}

这次把属性也先设置好:

  • score_mode:functions之间的分数叠加方式;
  • boost_mode:query score和function score之间的分数叠加方式;

然后把上面的子查询嵌入新的function_score的query处:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
GET nested-properties/_explain/1
{
  "query": {
    "function_score": {
      "query": {
        "nested": {
          "path": "employees",
          "query": {
            "function_score": {
              "query": {
                "bool": {
                  "filter": {
                    "match": {
                      "employees.name": "pikachu"
                    }
                  }
                }
              },
              "functions": [
                {
                  "field_value_factor": {
                    "field": "employees.age",
                    "factor": 1,
                    "modifier": "none",
                    "missing": 0
                  }
                }
              ],
              "boost_mode": "sum"
            }
          },
          "score_mode": "sum"
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "manager.age"
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "sum"
    }
  }
}

得分详情:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
{
  "_index": "nested-properties",
  "_id": "1",
  "matched": true,
  "explanation": {
    "value": 6696,
    "description": "sum of:",
    "details": [
      {
        "value": 6696,
        "description": "sum of",
        "details": [
          {
            "value": 6666,
            "description": "Score based on 3 child docs in range from 3 to 6, best match:",
            "details": [
              {
                "value": 4321,
                "description": "sum of:",
                "details": [
                  {
                    "value": 4321,
                    "description": "sum of",
                    "details": [
                      {
                        "value": 0,
                        "description": "ConstantScore(employees.name:pikachu)^0.0",
                        "details": []
                      },
                      {
                        "value": 4321,
                        "description": "min of:",
                        "details": [
                          {
                            "value": 4321,
                            "description": "field value function: none(doc['employees.age'].value?:0.0 * factor=1.0)",
                            "details": []
                          },
                          {
                            "value": 3.4028235e+38,
                            "description": "maxBoost",
                            "details": []
                          }
                        ]
                      }
                    ]
                  },
                  {
                    "value": 0,
                    "description": "match on required clause, product of:",
                    "details": [
                      {
                        "value": 0,
                        "description": "# clause",
                        "details": []
                      },
                      {
                        "value": 1,
                        "description": "_nested_path:employees",
                        "details": []
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value": 30,
            "description": "min of:",
            "details": [
              {
                "value": 30,
                "description": "field value function: none(doc['manager.age'].value * factor=1.0)",
                "details": []
              },
              {
                "value": 3.4028235e+38,
                "description": "maxBoost",
                "details": []
              }
            ]
          }
        ]
      },
      {
        "value": 0,
        "description": "match on required clause, product of:",
        "details": [
          {
            "value": 0,
            "description": "# clause",
            "details": []
          },
          {
            "value": 1,
            "description": "FieldExistsQuery [field=_primary_term]",
            "details": []
          }
        ]
      }
    ]
  }
}

只看最外层的function score:query是nested query,得分6666,functions得分30,经过sum最终为6696。

重要的是分层!

那么怎么实现一开始说的过滤manager.name=alice呢?把上面的查询放到bool查询里即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
GET nested-properties/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "manager.name": "alice"
          }
        }
      ],
      "must": [
        {
          "function_score": {
            "query": {
              "nested": {
                "path": "employees",
                "query": {
                  "function_score": {
                    "query": {
                      "bool": {
                        "filter": {
                          "match": {
                            "employees.name": "pikachu"
                          }
                        }
                      }
                    },
                    "functions": [
                      {
                        "field_value_factor": {
                          "field": "employees.age",
                          "factor": 1,
                          "modifier": "none",
                          "missing": 0
                        }
                      }
                    ],
                    "boost_mode": "sum"
                  }
                },
                "score_mode": "sum"
              }
            },
            "functions": [
              {
                "field_value_factor": {
                  "field": "manager.age"
                }
              }
            ],
            "score_mode": "sum",
            "boost_mode": "sum"
          }
        }
      ]
    }
  }
}

相当于有两个独立的子查询,一个在过滤alice,另一个在计算每一条父文档的得分。

但是如果思路不同,还能写出其他的查询方式。比如先过滤alice,计算每一个父文档的所有子文档得分,最终再和父文档的manager.age做加权:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
GET nested-properties/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "match": {
                "manager.name": "alice"
              }
            }
          ],
          "must": [
            {
              "nested": {
                "path": "employees",
                "query": {
                  "function_score": {
                    "query": {
                      "terms": {
                        "employees.name": [
                          "pikachu"
                        ]
                      }
                    },
                    "functions": [
                      {
                        "field_value_factor": {
                          "field": "employees.age",
                          "factor": 1,
                          "modifier": "none",
                          "missing": 0
                        }
                      }
                    ],
                    "score_mode": "sum",
                    "boost_mode": "multiply"
                  }
                },
                "score_mode": "sum"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "manager.age"
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "sum"
    }
  }
}

得分也是6696。仔细看看,它和第一个查询的区别在于,把内层function_score查询的functions函数拎出来放到了外层的function_score里,对于得分来说是等价的,但是对应了两种不同的思路,当然两种思路的效果也是等价的。

上面的查询还修改了另外一个地方:内层function_score的query从bool+filter改成了terms,所以query的得分从0分变成了1分,相应的,function_scoreboost_mode从sum改为了multiply,得分不变。

terms查询term查询不同:前者得分为1,后者得分为tfidf。

terms查询的得分为1是因为它是一个布尔查询,只要有一个匹配的项,就会将该文档返回为匹配文档,因此每个匹配项的得分都被认为是相同的,并且都为1。因为terms查询的得分仅仅反映了文档是否匹配查询条件,而不反映匹配程度的高低。因此,在terms查询中,文档的相关性得分与查询无关。

所以不想用tfidf的话,用terms不错。

和parent-child结合

道理和nested是一样的,无论叠加什么函数,重要的是分清把谁放在谁里面:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
GET /manager_employee/_search
{
  "query":{
    "bool":{
      "filter":[
        {
          "term":{
            "manager.platform":"platform1"
          }
        }
      ],
      "must":[
        {
          "function_score":{
            "query":{
              "has_child":{
                "type":"employee",
                "score_mode":"sum",
                "query":{
                  "bool":{
                    "should":[
                      {
                        "function_score":{
                          "query":{
                            "match_phrase":{
                              "employee.title.":"halo infinite"
                            }
                          },
                          "boost":10
                        }
                      },
                      {
                        "function_score":{
                          "query":{
                            "match_phrase":{
                              "employee.description.":"halo infinite"
                            }
                          },
                          "boost":1
                        }
                      },
                      {
                        "function_score":{
                          "query":{
                            "match_phrase":{
                              "employee.tags.eng":"halo infinite"
                            }
                          },
                          "boost":1
                        }
                      }
                    ],
                    "filter":[
                      {
                        "range":{
                          "employee.timestamp":{
                            "gte":1640966400000
                          }
                        }
                      }
                    ],
                    "minimum_should_match":1
                  }
                }
              }
            },
            "script_score":{
              "script":"_score * Math.log(3 + doc['manager.age'].value)"
            },
            "score_mode":"multiply"
          }
        }
      ]
    }
  }
}

综上,使用function_score写查询的关键是:头脑清晰。先外后内,一层层写出每一个查询,再拼接起来。尤其是注意区分nested/has_child本身的分值叠加方式和function score查询的分值叠加方式,不要混为一谈。

翻译成Java代码

怎么把上面的查询用Java client写出来……查询本身就很复杂了,这要是再翻译成代码……

开摆,交给ChatGPT!

我有一个这样的索引:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
PUT nested-properties
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "first": {
        "type": "text"
      },
      "second": {
        "type": "text"
      },
      "manager": {
        "properties": {
          "age":  { "type": "integer" },
          "name": { "type": "text"  }
        }
      },
      "employees": {
        "type": "nested",
        "properties": { 
          "age":  { "type": "integer" },
          "name": { "type": "text"  }
        }
      }
    }
  }
}

把下面的elasticsearch查询使用Java api写,代码怎么写

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
GET nested-properties/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "match": {
                "manager.name": "alice"
              }
            }
          ],
          "must": [
            {
              "nested": {
                "path": "employees",
                "query": {
                  "function_score": {
                    "query": {
                      "terms": {
                        "employees.name": [
                          "pikachu"
                        ]
                      }
                    },
                    "functions": [
                      {
                        "field_value_factor": {
                          "field": "employees.age",
                          "factor": 1,
                          "modifier": "none",
                          "missing": 0
                        }
                      },
                      {
                        "field_value_factor": {
                          "field": "employees.age",
                          "factor": 1,
                          "modifier": "none",
                          "missing": 0
                        }
                      }
                    ],
                    "score_mode": "sum",
                    "boost_mode": "multiply"
                  }
                },
                "score_mode": "sum"
              }
            }
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "manager.age"
          }
        },
        {
          "field_value_factor": {
            "field": "manager.age"
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "sum"
    }
  }
}

以下是使用Java API编写的等效Elasticsearch查询代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.BoolQueryBuilder;
import org.elasticsearch.index.query.MatchQueryBuilder;
import org.elasticsearch.index.query.NestedQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.TermsQueryBuilder;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.sort.FieldSortBuilder;
import org.elasticsearch.search.sort.SortOrder;

// 创建RestHighLevelClient实例
RestHighLevelClient client = new RestHighLevelClient();

// 构建查询请求
SearchRequest searchRequest = new SearchRequest("nested-properties");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();

// 构建function_score查询
BoolQueryBuilder queryBuilder = QueryBuilders.boolQuery();

// 添加manager.name的filter
MatchQueryBuilder managerNameQuery = QueryBuilders.matchQuery("manager.name", "alice");
queryBuilder.filter(managerNameQuery);

// 构建nested查询
NestedQueryBuilder nestedQuery = QueryBuilders.nestedQuery("employees",
        QueryBuilders.functionScoreQuery(
                QueryBuilders.termsQuery("employees.name", "pikachu"),
                new FunctionScoreQueryBuilder.FilterFunctionBuilder[]{
                        new FunctionScoreQueryBuilder.FilterFunctionBuilder(
                                QueryBuilders.fieldValueFactorFunction("employees.age")
                                        .factor(1)
                                        .modifier(FunctionScoreQueryBuilder.Modifier.NONE)
                                        .missing(0)
                        ),
                        new FunctionScoreQueryBuilder.FilterFunctionBuilder(
                                QueryBuilders.fieldValueFactorFunction("employees.age")
                                        .factor(1)
                                        .modifier(FunctionScoreQueryBuilder.Modifier.NONE)
                                        .missing(0)
                        )
                }
        )
                .scoreMode(FunctionScoreQuery.ScoreMode.SUM)
                .boostMode(CombineFunction.MULTIPLY)
);
queryBuilder.must(nestedQuery);

// 添加manager.age的function
queryBuilder.addFunction(
        new FieldValueFactorFunctionBuilder("manager.age")
                .factor(1)
                .modifier(FieldValueFactorFunction.Modifier.NONE)
);
queryBuilder.addFunction(
        new FieldValueFactorFunctionBuilder("manager.age")
                .factor(1)
                .modifier(FieldValueFactorFunction.Modifier.NONE)
);

// 设置查询条件和排序
sourceBuilder.query(queryBuilder);
sourceBuilder.sort(new FieldSortBuilder("_score").order(SortOrder.DESC));

// 将查询源添加到查询请求
searchRequest.source(sourceBuilder);

// 执行查询请求
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);

// 处理查询结果
// ...

// 关闭RestHighLevelClient连接
client.close();

请注意,您需要根据您的Elasticsearch集群配置和Java客户端版本进行适当的设置和错误处理。此代码仅作为示例,并可能需要根据您的实际环境进行调整。

不错,大致是能用的。技术确实改变了世界,至少改变了我写代码的方式。

本文由作者按照 CC BY 4.0 进行授权