文章

Elasticsearch:search

Elasticsearch,主要目的就是search。最简单的寻找数据的方式是遍历,缺点是耗时;es搜索非常迅速,因为在存储上做了很多支持,通俗地说就是以空间换时间。除了快,es还支持种类繁多的各式搜索,以实现不同的目的。

  1. 搜索
    1. _id搜索
    2. 按field搜索:search api
    3. term - 完整匹配
    4. terms - mysql in
    5. match - 多词查询
    6. match_phrase - 考虑token位置
      1. match + match_phrase - 混合起来搜索
      2. shingle - 介于match和match_phrase之间
    7. bool - 组合搜索
      1. filter缓存
    8. multi_match - 一种特殊的bool
    9. exists - 值不为空
    10. not exists
    11. range - 范围匹配
    12. fuzzy - 自动纠错
    13. 返回随机结果
    14. 其他不太会用到的搜索
      1. query_string - 搜索所有field
      2. match_phrase_prefix - 短语前缀搜索
      3. prefix - 前缀搜索
      4. index_prefix - 前缀索引搜索
      5. wildcard - 通配符搜索
    15. 搜索选项
      1. _source
  2. 倒排索引 - 空间换时间
  3. analyzer
  4. highlight

搜索

es有两种搜索方式:

  1. 常用的_search api:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
  2. 直接按照_id获取文档:GET <index>/_doc/<_id>

search api是最常用的搜索,功能很丰富,但它是 从磁盘搜索,所以新写入的数据必须在refresh后才能被搜索到,仅仅在buffer pool里是不能被搜索的

关于refresh和buffer pool:Elasticsearch:分片读写

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/making-text-searchable.html

    早期的全文检索会为整个文档集合建立一个很大的倒排索引并将其 写入到磁盘。 一旦新的索引就绪,旧的就会被其替换,这样最近的变化便可以被检索到。

按照_id搜索,是可以直接搜索到的,和refresh interval无关。

_id搜索

之所以和refresh无关,因为_id不需要走Lucene索引:_id is realtime, not near-realtime

  • https://www.reddit.com/r/elasticsearch/comments/oqkyom/comment/h6cfnlr/?utm_source=share&utm_medium=web2x&context=3

realtime get: regardless of the “refresh rate”

  • https://github.com/elastic/elasticsearch/issues/1060
  • https://elasticsearch.cn/question/4278

query是走倒排索引查询的所以会出现refresh未执行搜索不到的情况,但是如果你用get,也就是通过id查询的话,他会先从translog拿一下,写translog是写操作的第一步,就不受refresh影响了。2.2的代码我看了是这么做的,但是5.x不太清楚了

举个例子:

1
2
3
4
5
6
7
8
9
10
11
12
PUT /my_logs
{
  "settings": {
    "refresh_interval": "30s" 
  }
}

PUT my_logs/_doc/1
{
  "name": "1",
  "age": 233
}

直接使用_id是可以搜到的,虽然还没到30s:

1
GET my_logs/_doc/1

search api不行:

1
GET my_logs/_search

同理,update by query因为也要用到query,所以其实是改不到这个文档的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
POST my_logs/_update_by_query
{
  "query": {
    "term": {
      "name.keyword": {
        "value": "1"
      }
    }
  },
  "script": {
    "source": "ctx['name'] = '2'",
    "lang": "painless"
  }
}

但是使用_id则可以直接更新:

1
2
3
4
5
6
POST my_logs/_update/1
{
  "doc": {
    "name": "2"
  }
}

所以按照_id直接搜索、直接更新是可以的,和refresh interval无关。

_search_update_by_query都是受影响的。写入之后,如果没有refresh,update by qeury返回结果是0,找不到要更新的文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
  "took" : 1,
  "timed_out" : false,
  "total" : 0,
  "updated" : 0,
  "deleted" : 0,
  "batches" : 0,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

search api也可以使用_id field,这种search虽然用了_id,但是因为用的是search api,决定了它要去磁盘上找文档,所以自然也搜不到:

1
2
3
4
5
6
7
8
9
10
GET my_logs/_search
{
  "query": {
    "term": {
      "_id": {
        "value": "1"
      }
    }
  }
}

因为_id默认也被加到Lucene里了,为了方便搜索。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html

Each document has an _id that uniquely identifies it, which is indexed so that documents can be looked up either with the GET API or the ids query.

按field搜索:search api

search api虽然有refresh interval限制,但是它强啊!

search有url形式的api,可以直接通过参数传递搜索的条件。但随着搜索变得复杂,还是使用body传搜索条件更清晰。

term - 完整匹配

term用于搜索keyword类型。在Elasticsearch:basic中介绍过,keyword类型的field是囫囵存储,不会把字符串内容拆分为token。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

term有一个常见误区:如果对text type的field使用term查询,往往搜不到文档。比如存储的是John,使用term John搜索搜不到这条文档,因为text默认会把John处理为小写形式,所以term john才能搜到这条文档。所以对text类型,不要用term!使用match!

terms - mysql in

term是拿文档和一个值进行匹配,terms是和一堆值进行匹配,匹配上其中一个就可以。类似于mysql的in。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html

它也有minimum_should_match,和bool的should一样,指定文档和目标值们至少匹配的几个才算匹配。

match - 多词查询

match用于搜索text类型,做的是token上的匹配。text会把存储的字符串拆分为token做倒排索引,match会对搜索字符串做同样的拆分,拿其中的每一个token和已存储的倒排索引做匹配。

match各个token之间是“或”的关系:只要有一个token和文档匹配就算匹配。但是也可以提高精度:

  • operator选择and,就是“与”的关系
  • minimum_should_match:介于“或”和“与”之间,可以设置数字(匹配词个数)或者百分比,比如有75%的词匹配就行(如果75%的词不是整数个,es会取整);

Ref:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/multi-match-query.html

match_phrase - 考虑token位置

短语匹配,会比match匹配更精准。文档里必须有这个短语才行。短语匹配不仅使用token,还使用token的位置:比如“hello pikachu”,不仅要匹配到hello和pikachu,pikachu的position还必须比hello大1,才算匹配。

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/phrase-matching.html#_%E4%BB%80%E4%B9%88%E6%98%AF%E7%9F%AD%E8%AF%AD

match phrase很严格,有可能导致匹配到的文档条目过少。可以设置slop属性,允许短语之间的单词出现一定的间隔、错位:

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/slop.html

最终的排序结果,仍然是离得近的排序更靠前。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html
  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/phrase-matching.html

match + match_phrase - 混合起来搜索

match太松,match_phrase太严,把他们俩混合起来,打分叠加,查到的内容可能会更符合期待:

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/proximity-relevance.html

shingle - 介于match和match_phrase之间

见后文。

bool - 组合搜索

bool是多种搜索的组合:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

支持与或非,

  • must:与;
  • should:或;
  • must_not:非;
  • filter:也是与。但是该query不计入score,所以叫“过滤”。由于不需要计算分数,所以filter比match query快

每一个与或非内,也是一个query的term或者match:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user.id" : "kimchy" }
      },
      "filter": {
        "term" : { "tags" : "production" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [
        { "term" : { "tags" : "env1" } },
        { "term" : { "tags" : "deployed" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

和match(token之间的或)一样,bool也支持minimum_should_match,代表或语句(should)至少要命中几个

minimum_should_match的默认值是不定的:If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.

任何一个查询语句都可以用boost属性给打分加权

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/_boosting_query_clauses.html

filter缓存

一个filter在该Lucene上的过滤结果可以通过二进制集合(类似Java中的bitset)表示,因为Lucene是不可变的,这个集合就可以被缓存下来,留待后续使用

multi_match - 一种特殊的bool

如果对不同的field执行相同的搜索,可以直接用multi_match,而不必“使用多个match,再用bool组合起来”:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html
  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/multi-match-query.html

exists - 值不为空

  • The field in the source JSON is null or []
  • The field has “index” : false set in the mapping
  • The length of the field value exceeded an ignore_above setting in the mapping
  • The field value was malformed and ignore_malformed was defined in the mapping

这样的文档都不会被exists query搜索到。

null和空对es来说没啥区别,最终结果都是不加入索引

  • https://www.elastic.co/guide/en/elasticsearch/guide/current/_dealing_with_null_values.html

查询索引里有没有,相当于 where field is not not:

1
2
3
4
5
6
7
8
GET /index/_count
{
  "query": {
    "exists": {
      "field": "updateTime"
    }
  }
}

not exists

“不存在”,missing已经deprecate了,用must_not + exists

1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET /index/_count
{
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "updateTime"
          }
        }
      ]
    }
  }
}

eg:把字段更新为null:

1
2
3
4
5
6
POST /puzzle/_update/_bCvS30BsRxrU7Dv8KYG
{
  "doc": {
    "article": null
  }
}

查询:

1
2
3
4
5
6
7
8
9
10
GET /puzzle/_search
{
  "query": {
    "term": {
      "_id": {
        "value": "_bCvS30BsRxrU7Dv8KYG"
      }
    }
  }
}

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "puzzle",
        "_type" : "_doc",
        "_id" : "_bCvS30BsRxrU7Dv8KYG",
        "_score" : 1.0,
        "_source" : {
          "article" : null
        }
      }
    ]
  }
}

用 must_not + exists 也可以查出来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET /puzzle/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "exists": {
            "field": "article"
          }
        }
      ]
    }
  }
}
  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html

range - 范围匹配

对数量的range query比较简单,主要是对time的range query比较复杂一些:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET /index/_search
{
  "query": {
    "range": {
      "updateTime": {
        "gte": "2021-08-26",
        "lt": "2021-08-27",
        "format": "yyyy-MM-dd",
        "time_zone": "+08:00"
      }
    }
  },
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ]
}
1
2
3
4
5
6
7
8
9
10
11
12
GET /index/_count
{
  "query": {
    "range": {
      "timestamp": {
        "gte": "now-7d/d",
        "lt": "now/d",
        "time_zone": "+08:00"
      }
    }
  }
}

之前说了,es里的时间本质上是string或者long epoch milli。在时间上,range搜索时支持不同时区时间自动转换,所以写起来稍微麻烦一些。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html

fuzzy - 自动纠错

即使输入词某些字母输错了,依旧可以匹配上正确的文档!

To find similar terms, the fuzzy query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion.

所以原本一个token会按照Levenshtein distance产生多个可能错误的token,然后分别和这些错误的token匹配。

大概需要匹配的索引翻了好几倍,所以如果search.allow_expensive_queries=false,不会执行。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html

返回随机结果

假设指定一个条件,想随机返回满足条件的任意一个文档,可以使用function_scorerandom_score,同时使用filter做条件过滤,不考虑文档得分。但是注意一定要修改boost_mode,默认用的是multiply,但因为有0分存在,random_score乘以0没什么用了。可以换成sum,最好使用replace

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
GET witake_media/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "filter": [
            {
              "term": {
                "platform": {
                  "value": "TikTok"
                }
              }
            },
            {
              "match": {
                "description.text": "prank"
              }
            }
          ]
        }
      },
      "random_score": {},
      "boost_mode": "replace"
    }
  },
  "size": 1
}

replace: only function score is used, the query score is ignored

  • https://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch

其他不太会用到的搜索

query_string - 搜索所有field

默认查询文档的所有field(但不包括nested文档),除非指定了field。甚至可以在搜索词里指定AND/OR进行多条件查询。

示例:name:nosql AND -description:mongodb,搜索name里包含nosql,且description不包含mongodb的文档。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

两个显著缺点:

  1. 不好阅读,不易维护和扩展;
  2. 查询过于强大,如果直接接收用户输入作为搜索词,用户查询的权限就比较大;

match_phrase_prefix - 短语前缀搜索

和match_phrase类似,不过最后一个词用前缀。比如“hello wor”可以匹配“hello world”,也可以匹配“hello word”。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

因为前面的词可以用倒排索引搜索,最后一个词要做前缀判断,所以会比match/match_phrase慢。可以使用max_expansions指定最多匹配多少个,以缩短返回时间。默认是50个。

prefix - 前缀搜索

前缀查询。默认不分析搜索词,所以大小写影响搜索结果。可以设置case_insensitive属性为true。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html

但是它实在不够高效。所以如果search.allow_expensive_queries=false,不会执行prefix query

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html#prefix-query-allow-expensive-queries

index_prefix - 前缀索引搜索

定义mapping的时候,给field做前缀索引,这样的话再用前缀搜,就可以和前缀索引匹配,就快了。前缀的最小最大长度可以指定一下。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/index-prefixes.html

和edge n-gram很像。

wildcard - 通配符搜索

越来越离谱了……可以做通配符匹配搜索。慢是肯定的,而且因为通配符可以出现在任何地方,所以不像index_prefix还能拯救一下prefix,没有人能拯救wildcard搜索的速度。所以如果search.allow_expensive_queries=false,不会执行wildcard query。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

搜索选项

_source

控制返回的_source字段。默认返回所有。如果不需要太多字段,只返回指定的field可以节省网络开销。甚至还支持通配符匹配field、include/exclude某些field:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html

倒排索引 - 空间换时间

es的搜索快,主要是因为提前建好了倒排索引,以空间换时间:

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/inverted-index.html

遍历是O(n)操作,而倒排查询是O(1)操作。

es举了个倒排索引的例子——

假设有两个文档:

  1. The quick brown fox jumped over the lazy dog
  2. Quick brown foxes leap over lazy dogs in summer

按空格拆分后,构建出来的倒排索引就是这样的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

如果搜索quick brown,搜索到的结果:

1
2
3
4
5
6
Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

analyzer

Elasticsearch:analyzer

highlight

highlight能给搜索结果高亮,方便发现为什么和文档匹配了。尤其是在变形词、带slop的match phrase匹配的场合:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html

highlight需要指定高亮哪些field,一般是用于搜索的field:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET /<index>/_search
{
  "query": {
    "match": {
      "description.reb_eng": "pikachu"
    }
  },
  "highlight": {
    "fields": {
      "description.reb_eng": {}
    }
  },
  "size": 1
}

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
{
  "took" : 448,
  "timed_out" : false,
  "_shards" : {
    "total" : 15,
    "successful" : 15,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 18.429892,
    "hits" : [
      {
        "_index" : "witake_media_v6",
        "_type" : "_doc",
        "_id" : "346980-oFcT1jqQGA4",
        "_score" : 18.429892,
        "_routing" : "346980",
        "_source" : {
          "description" : """PIKACHU GAMING KD AND STATISTICS REVEALED | VEGA PIKACHU PUBG ID|BEST PUBG MOBILE PLAYER OF PAKISTAN
VEGA PIKACHU PUBG ID
VEGA PIKACHU CLUTCHES
BABLU PIKACHU CLUTCHES
TEAM BABLU PAKISTAN
PMCO PAKISTAN
VEGA PIKACHU best player of Pakistan
VEGA pickachu tiktok
Vegapikachu 1vs4
Teen Wolf gaming
FS black
FS baba
Freestyle PMWL
Team Vega BABLU
Vega tyrnat pubg ID
Vega Dani pubg ID
FS KASHOOF
FS Malik
pikachu gaming pubg,
pikachu gaming handcam,
pikachu gaming live,
pikachu gaming yt,
pikachu gaming tdm,
pikachu gaming face reveal,
pikachu gaming raid,
pikachu gaming control,
pikachu gaming axom,
bablo pikachu gaming,
bablu pikachu gaming,
bia pikachu gaming,
black pikachu gaming,
pikachu gaming channel,
pikachu gaming chair,
pikachu gaming cod,
top 10 gaming pikachu creepypasta,
pikachu cookie gaming,
pikachu chibi gaming,
pikachu gaming device,
dr pikachu gaming,
dnm pikachu gaming,
dac pikachu gaming,
pikachu gaming ff,
pika pikachu gaming face reveal,
false swipe gaming pikachu,
pikachu gaming god level sniping in tdm,
gaming pikachu one game,
glich'e gaming pikachu,
pikachu gaming hd,
hyper pikachu gaming,
headshot montage pikachu gaming,
pikachu gaming intro,
pikachu gaming id,
gaming with kev pikachu,
king pikachu gaming,
pikachu k20 gaming,
pikachu gaming logo,
pika pikachu gaming live,
pikachu gaming pubg lite,
luigikid gaming pikachu,
pikachu gaming monster legends,
pikachu gaming montage,
pikachu gaming movement,
pikachu gaming mamba raid,
pikachu gaming music,
pikachu gaming mix,
mr pikachu gaming,
master pikachu gaming,
pikachu noob gaming,
pikachu gaming odisha,
op pikachu gaming,
pikachu gaming tamil official,
pikachu gaming pmco,
pikachu gaming pro,
pikachu gaming practice,
pika pikachu gaming pmco,
pika pikachu gaming sensitivity,
pikachu gaming random,
real pikachu gaming,
rp pikachu gaming,
pikachu gaming review,
pikachu gaming r,
pikachu gaming scrims,
pikachu gaming song,
pikachu gaming sensitivity,
pikachu gaming squad,
pikachu gaming setup,
star pikachu gaming,
super pikachu gaming,
pikachu gaming tamil,
the pikachu gaming,
team pikachu gaming,
pikachu gaming toto,
pikachu gaming tv,
vega pikachu gaming,
sky vs gaming pikachu,
viano gaming pikachu,
pikachu gaming vn,
gaming with pikachu,
pikachu gaming xt,
pikachu 01 gaming,
pikachu gaming 123,
pikachu gaming #1,
gaming music 2020 pikachu
vega pikachu,
vega pikachu handcam,
vega pikachu pmco,
vega pikachu control,
vega pikachu scrims,
vega pikachu vs legend sam,
vega pikachu tdm,
vega pikachu pubg id,
vega pikachu gaming,
vega pikachu competitive,
vega pikachu vs captain king,
vega pikachu face reveal,
vega pikachu id,
vega pikachu kd,
vega pikachu live stream,
vega pikachu montage,
vega pikachu pubg mobile,
vega pikachu pubg,
vega pikachu sensitivity,
vega pikachu tournament,
vega pikachu vs vega dani,
mega pikachu y,
mega pikachu x and y,
mega pikachu z"""
        },
        "highlight" : {
          "description.reb_eng" : [
            """pubg,
<em>pikachu</em> gaming handcam,
<em>pikachu</em> gaming live,
<em>pikachu</em> gaming yt,
<em>pikachu</em> gaming tdm,
<em>pikachu</em> gaming""",
            """bablu <em>pikachu</em> gaming,
bia <em>pikachu</em> gaming,
black <em>pikachu</em> gaming,
<em>pikachu</em> gaming channel,
<em>pikachu</em> gaming""",
            """,
<em>pikachu</em> gaming device,
dr <em>pikachu</em> gaming,
dnm <em>pikachu</em> gaming,
dac <em>pikachu</em> gaming,
<em>pikachu</em> gaming ff""",
            """montage <em>pikachu</em> gaming,
<em>pikachu</em> gaming intro,
<em>pikachu</em> gaming id,
gaming with kev <em>pikachu</em>,
king <em>pikachu</em>""",
            """<em>pikachu</em>,
viano gaming <em>pikachu</em>,
<em>pikachu</em> gaming vn,
gaming with <em>pikachu</em>,
<em>pikachu</em> gaming xt,
<em>pikachu</em> 01"""
          ]
        }
      }
    ]
  }
}

从结果可以看出,如果整个fileld的内容特别多,highlight不会给整个field highlight,而是返回一个数组,默认包含5段(fragment)带高亮的文字。每段文字包含一部分文本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
        "highlight" : {
          "description.reb_eng" : [
            """pubg,
<em>pikachu</em> gaming handcam,
<em>pikachu</em> gaming live,
<em>pikachu</em> gaming yt,
<em>pikachu</em> gaming tdm,
<em>pikachu</em> gaming""",
            """bablu <em>pikachu</em> gaming,
bia <em>pikachu</em> gaming,
black <em>pikachu</em> gaming,
<em>pikachu</em> gaming channel,
<em>pikachu</em> gaming""",
            """,
<em>pikachu</em> gaming device,
dr <em>pikachu</em> gaming,
dnm <em>pikachu</em> gaming,
dac <em>pikachu</em> gaming,
<em>pikachu</em> gaming ff""",
            """montage <em>pikachu</em> gaming,
<em>pikachu</em> gaming intro,
<em>pikachu</em> gaming id,
gaming with kev <em>pikachu</em>,
king <em>pikachu</em>""",
            """<em>pikachu</em>,
viano gaming <em>pikachu</em>,
<em>pikachu</em> gaming vn,
gaming with <em>pikachu</em>,
<em>pikachu</em> gaming xt,
<em>pikachu</em> 01"""
          ]
        }

控制highlight行为的参数:

  • number_of_fragments:返回几小段。默认是5;
  • fragment_size:每一小段的长度,默认100个字符。会在超出100的单词的边界处截断。所以实际一小段可能会比100大点儿;

怎么确定一小段(fragment)的起点和终点?大致可以理解为:

  1. 找到第一个匹配的token,算是fragment的起点;
  2. 继续往下数,直到加上某个token的长度之后,总长度超出了fragment_size

因此,fragment_size界定了一个fragment的范围,默认是100个字符。如果设置为20:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET /<index>/_search
{
  "query": {
    "match": {
      "description.reb_eng": "pikachu"
    }
  },
  "highlight": {
    "fragment_size" : 20,
    "fields": {
      "description.reb_eng": {}
    }
  },
  "size": 1
}

返回的5小段,每一小段都更短了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
        "highlight" : {
          "description.reb_eng" : [
            "<em>PIKACHU</em> GAMING KD AND",
            """PUBG ID
VEGA <em>PIKACHU</em>""",
            """,
<em>pikachu</em> gaming r,
<em>pikachu</em>""",
            """<em>pikachu</em> gaming xt,
<em>pikachu</em>""",
            """vega <em>pikachu</em>,
vega <em>pikachu</em>"""
          ]
        }

如果想返回整个field的内容(并高亮),需要设置number_of_fragments=0,此时返回结果不再是n小段,而是一整个field,fragment_size的值将会被忽略:

The maximum number of fragments to return. If the number of fragments is set to 0, no fragments are returned. Instead, the entire field contents are highlighted and returned. This can be handy when you need to highlight short texts such as a title or address, but fragmentation is not required. If number_of_fragments is 0, fragment_size is ignored.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET /<index>/_search
{
  "query": {
    "match": {
      "description.reb_eng": "pikachu"
    }
  },
  "highlight": {
    "number_of_fragments": 0, 
    "fields": {
      "description.reb_eng": {}
    }
  },
  "size": 1
}

此时返回的是整个field的高亮内容,作为一整段:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
        "highlight" : {
          "description.reb_eng" : [
            """<em>PIKACHU</em> GAMING KD AND STATISTICS REVEALED | VEGA <em>PIKACHU</em> PUBG ID|BEST PUBG MOBILE PLAYER OF PAKISTAN
VEGA <em>PIKACHU</em> PUBG ID
VEGA <em>PIKACHU</em> CLUTCHES
BABLU <em>PIKACHU</em> CLUTCHES
TEAM BABLU PAKISTAN
PMCO PAKISTAN
VEGA <em>PIKACHU</em> best player of Pakistan
VEGA pickachu tiktok
Vegapikachu 1vs4
Teen Wolf gaming
FS black
FS baba
Freestyle PMWL
Team Vega BABLU
Vega tyrnat pubg ID
Vega Dani pubg ID
FS KASHOOF
FS Malik
<em>pikachu</em> gaming pubg,
<em>pikachu</em> gaming handcam,
<em>pikachu</em> gaming live,
<em>pikachu</em> gaming yt,
<em>pikachu</em> gaming tdm,
<em>pikachu</em> gaming face reveal,
<em>pikachu</em> gaming raid,
<em>pikachu</em> gaming control,
<em>pikachu</em> gaming axom,
bablo <em>pikachu</em> gaming,
bablu <em>pikachu</em> gaming,
bia <em>pikachu</em> gaming,
black <em>pikachu</em> gaming,
<em>pikachu</em> gaming channel,
<em>pikachu</em> gaming chair,
<em>pikachu</em> gaming cod,
top 10 gaming <em>pikachu</em> creepypasta,
<em>pikachu</em> cookie gaming,
<em>pikachu</em> chibi gaming,
<em>pikachu</em> gaming device,
dr <em>pikachu</em> gaming,
dnm <em>pikachu</em> gaming,
dac <em>pikachu</em> gaming,
<em>pikachu</em> gaming ff,
pika <em>pikachu</em> gaming face reveal,
false swipe gaming <em>pikachu</em>,
<em>pikachu</em> gaming god level sniping in tdm,
gaming <em>pikachu</em> one game,
glich'e gaming <em>pikachu</em>,
<em>pikachu</em> gaming hd,
hyper <em>pikachu</em> gaming,
headshot montage <em>pikachu</em> gaming,
<em>pikachu</em> gaming intro,
<em>pikachu</em> gaming id,
gaming with kev <em>pikachu</em>,
king <em>pikachu</em> gaming,
<em>pikachu</em> k20 gaming,
<em>pikachu</em> gaming logo,
pika <em>pikachu</em> gaming live,
<em>pikachu</em> gaming pubg lite,
luigikid gaming <em>pikachu</em>,
<em>pikachu</em> gaming monster legends,
<em>pikachu</em> gaming montage,
<em>pikachu</em> gaming movement,
<em>pikachu</em> gaming mamba raid,
<em>pikachu</em> gaming music,
<em>pikachu</em> gaming mix,
mr <em>pikachu</em> gaming,
master <em>pikachu</em> gaming,
<em>pikachu</em> noob gaming,
<em>pikachu</em> gaming odisha,
op <em>pikachu</em> gaming,
<em>pikachu</em> gaming tamil official,
<em>pikachu</em> gaming pmco,
<em>pikachu</em> gaming pro,
<em>pikachu</em> gaming practice,
pika <em>pikachu</em> gaming pmco,
pika <em>pikachu</em> gaming sensitivity,
<em>pikachu</em> gaming random,
real <em>pikachu</em> gaming,
rp <em>pikachu</em> gaming,
<em>pikachu</em> gaming review,
<em>pikachu</em> gaming r,
<em>pikachu</em> gaming scrims,
<em>pikachu</em> gaming song,
<em>pikachu</em> gaming sensitivity,
<em>pikachu</em> gaming squad,
<em>pikachu</em> gaming setup,
star <em>pikachu</em> gaming,
super <em>pikachu</em> gaming,
<em>pikachu</em> gaming tamil,
the <em>pikachu</em> gaming,
team <em>pikachu</em> gaming,
<em>pikachu</em> gaming toto,
<em>pikachu</em> gaming tv,
vega <em>pikachu</em> gaming,
sky vs gaming <em>pikachu</em>,
viano gaming <em>pikachu</em>,
<em>pikachu</em> gaming vn,
gaming with <em>pikachu</em>,
<em>pikachu</em> gaming xt,
<em>pikachu</em> 01 gaming,
<em>pikachu</em> gaming 123,
<em>pikachu</em> gaming #1,
gaming music 2020 <em>pikachu</em>
vega <em>pikachu</em>,
vega <em>pikachu</em> handcam,
vega <em>pikachu</em> pmco,
vega <em>pikachu</em> control,
vega <em>pikachu</em> scrims,
vega <em>pikachu</em> vs legend sam,
vega <em>pikachu</em> tdm,
vega <em>pikachu</em> pubg id,
vega <em>pikachu</em> gaming,
vega <em>pikachu</em> competitive,
vega <em>pikachu</em> vs captain king,
vega <em>pikachu</em> face reveal,
vega <em>pikachu</em> id,
vega <em>pikachu</em> kd,
vega <em>pikachu</em> live stream,
vega <em>pikachu</em> montage,
vega <em>pikachu</em> pubg mobile,
vega <em>pikachu</em> pubg,
vega <em>pikachu</em> sensitivity,
vega <em>pikachu</em> tournament,
vega <em>pikachu</em> vs vega dani,
mega <em>pikachu</em> y,
mega <em>pikachu</em> x and y,
mega <em>pikachu</em> z"""
          ]
        }

一个比较极端的例子:使用“pikachu”搜索“【去吧!】pikachu go”,如果按照默认行为(只返回匹配到的分片,而不是返回整段),前面的“【去吧!】”不会返回,只会返回“<em>pikachu</em> go”,应该是中文字符被当做句子分界了(分界默认用的是java的BreakIterator):

Break highlighted fragments at the next sentence boundary, as determined by Java’s BreakIterator.

如果想返回完整的“【去吧!】<em>pikachu</em> go”,就要设置number_of_fragments=0

highlight默认高亮的是search query搜索的内容,不过elasticsearch还支持传入自定义的highlight_query,它可以和search query无关,它所搜索的内容就是最终所高亮的内容。参考这个场景

highlight_query用的应该不多。

本文由作者按照 CC BY 4.0 进行授权