Elasticsearch:analyzer
Elasticsearch的搜索,仅仅创建倒排索引是不够的。比如搜索时如果想忽略大小写,一个单纯的倒排并不能做到这一点,必须在倒排之前,对数据进行处理,全部转为小写。进行这些数据处理的,就是analyzer。
analyzer
我们要在建立倒排索引之前处理文本,就要使用analyzer定义文本的处理方式。analyzer由三部分组成:
char_filter
:字符过滤器。字符串在被分词之前,先对字符做一些过滤。比如替换&为and、去掉html tag等;tokenizer
:分词器。处理字符串,拆分为不同的token;filter
:Token 过滤器(感觉起名为token filter会更贴切。可能es觉得本来就是在处理token,所以不用再提token了)。改变token,也可以增删token。比如token小写、转为词根、增加近义词。
比如常用的standard analyzer,语言相关的english analyzer等。还可以添加第三方的比如ik analyzer,用于中文分词。
接下来以standard analyzer和english analyzer为例分别介绍char_filter/tokenizer/filter:
char_filter
es内置的char filter不是很多,也不太常用:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html
比如HTML Strip Character Filter。
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
- https://www.elastic.co/guide/cn/elasticsearch/guide/current/char-filters.html
standard analyzer和english analyzer都没有使用char filter。
tokenizer
内置的tokenizer:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
standard analyzer和english analyzer使用的都是standard tokenizer,它使用空格给英文分词。汉语则是一个汉字分为一个词,失去了语义:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-tokenizer.html
es可以使用analyze api非常方便地测试analyzer及其组件:
1
2
3
4
5
6
POST _analyze
POST _analyze
{
"tokenizer": "standard",
"text": "hello123 456 world 你好吗"
}
输出:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
"tokens" : [
{
"token" : "hello123",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "456",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "world",
"start_offset" : 13,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "你",
"start_offset" : 19,
"end_offset" : 20,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "好",
"start_offset" : 20,
"end_offset" : 21,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "吗",
"start_offset" : 21,
"end_offset" : 22,
"type" : "<IDEOGRAPHIC>",
"position" : 5
}
]
}
标准分词器做的事情还是挺多的,它会按照Unicode Standard Annex #29定义的统一文本处理算法去处理文本,包括丢掉标点之类的。但是它对中文等特定语言的处理并不好,如果要考虑这些语言的语义,则要使用更专业的分词器,比如icu或ik。
另外可以看出分出来的token除了token本身,还有位置信息、类型信息等,
match_phrase
搜索会使用position。
max_token_length
:默认255。标准分词器对token的长度是有限制的,超过255就在255处分词了。
注意:平时说的icu和ik,其实是分词器tokenizer。他们都提供了使用了该分词器的analyzer。比如ik_smart。
- https://www.elastic.co/guide/cn/elasticsearch/guide/current/icu-tokenizer.html
pattern tokenizer
自己指定pattern,进行分词:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html
UAX URL email tokenizer
给email和url分词:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-uaxurlemail-tokenizer.html
注意默认最大的token是255,url可能会超。
path hierarchy tokenizer
path和path的父目录都会是token:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html
filter
token filter就比较多了,毕竟对token的处理需求还是很多的:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
比如lowercase filter:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html
使用analyze api指定tokenizer和filter测试一下:
1
2
3
4
5
6
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}
输出:
1
[ the, quick, fox, jumps ]
stop word filter,过滤掉语言中的stop word,比如a/an/the等无意义词:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html
和tokenizer一样,filter也有属性可以自定义:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html#analysis-stop-tokenfilter-configure-parms
standard analyzer使用了lowercase filter,english analyzer则使用了很多filter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
第一个是stop word filter
,当不想要某些token时,可以把他们过滤掉。这里设置的stop word集合是es内置的english stop word,_english_
:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html#english-stop-words
最后两个filter是stemmer filter
,获取token的词根形式,可以极大扩大匹配范围:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html
第二个是keyword_marker filter
,当不想把某个词作为词根时使用:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html
reverse token filter
颠倒分词:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html
看起来似乎没用,但是如果要匹配*bar
,不如把token倒过来,再匹配rab*
,速度极大提升。
还有一种奇效:reverse + ngram + reverse:
1
2
3
4
5
6
GET /_analyze
{
"tokenizer": "standard",
"filter": ["reverse", "edge_ngram", "reverse"],
"text" : "Hello!"
}
从尾部生成edge_ngram。当然,和设置edge_ngram的side=back
一个效果。
自定义analyzer
如果所有内置的analyzer都不符合自己的需求,就需要自定义一个analyzer。
改变标准analyzer的任何一点设置,都是在自定义analyzer。一般情况下自定义analyzer,其实就是:
- 在标准char_filter/tokenizer/filter的基础上,改一改配置,生成一个新的char_filter/tokenizer/filter;
- 组装一个或多个char_filter/tokenizer/filter;
上面贴的english analyzer,实际就是对tokenizer、filter改了改设置,组装起来的。
比如自定义一个支持的token最大长度为5的standard analyzer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
也可以自定义一个char filter,把&转为and:
1
2
3
4
5
6
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [ "&=> and "]
}
}
实际是在mapping char_filter的基础上,改了改配置。
- https://www.elastic.co/guide/cn/elasticsearch/guide/current/custom-analyzers.html
最后把这些char_filter/tokenizer/filter,放到custom类型的analyzer里,组装成一个自己的analyzer:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html
analyzer的最小元素集合
一个analyzer虽然由三部分组成,但是char_filter和filter可以有任意个(zero or more),tokenizer必须有且只有一个(exactly one)。
最典型的例子就是whitespace analyzer,它只有一个tokenizer,whitespace tokenizer,按照空格分词。
_analyze
API
测试analyzer,也可以测试char_filter、tokenizer、filter:
1
2
3
4
5
6
7
GET /_analyze
{
"tokenizer" : "keyword",
"filter" : ["lowercase"],
"char_filter" : ["html_strip"],
"text" : "this is a <b>test</b>"
}
或者自定义一个truncate filter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET _analyze
{
"tokenizer": "keyword",
"filter": [
{
"type": "truncate",
"length": 7
},
"lowercase"
],
"text": [
"Hello World"
]
}
返回:
1
2
3
4
5
6
7
8
9
10
11
{
"tokens" : [
{
"token" : "hello w",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
}
]
}
一个比较方便的功能是引用已有的index的某个field的analyzer:
1
2
3
4
5
GET /<index>/_analyze
{
"field": "description",
"text": ["hello world"]
}
或者直接指定analyzer名称,无论analyzer还是search_analyzer都可以:
1
2
3
4
5
GET <index>/_analyze
{
"analyzer": "autocomplete_sentence_search",
"text": ["hello world"]
}
search as you type
有一些比较猛的tokenizer/filter,因为太强,所以单独拎出来说了。常用来做search as you type。
n-gram tokenizer/filter - 字符滑动窗口
token filter可以改变token,也可以增删token。stop word filter是删除token的例子,n-gram则是增加token的例子。
n-gram tokenizer和n-gram filter放在一起介绍:
- n-gram tokenizer分词产生n-gram;
- n-gram filter在已有分词器分好词的情况下,对每一个token产生n-gram;
tokenizer:
1
2
3
4
5
POST _analyze
{
"tokenizer": "ngram",
"text": "Quick Fox"
}
产生:
1
[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]
token_chars
属性配置了保留哪些词作为token。默认为空数组,即保留整个文本作为一个token,相当于把整个字符串当成一个keyword。min_gram
/max_gram
默认为1/2。
自定义tokenizer:
1
2
3
4
5
6
7
8
9
GET _analyze
{
"tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 3
},
"text": "hello world"
}
会产生一大堆token:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
{
"tokens" : [
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "hell",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "ell",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "ello",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "llo",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 4
},
{
"token" : "llo ",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 5
},
{
"token" : "lo ",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 6
},
{
"token" : "lo w",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 7
},
{
"token" : "o w",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 8
},
{
"token" : "o wo",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 9
},
{
"token" : " wo",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : " wor",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : 11
},
{
"token" : "wor",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 12
},
{
"token" : "worl",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 13
},
{
"token" : "orl",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 14
},
{
"token" : "orld",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 15
},
{
"token" : "rld",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 16
}
]
}
由于n-gram产生的token实在是太炸裂了,所以要求min和max的差值不能超过1,否则会报错:
1
2
3
4
5
6
7
8
9
10
11
12
13
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [5]. This limit can be set by changing the [index.max_ngram_diff] index level setting."
}
],
"type" : "illegal_argument_exception",
"reason" : "The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [5]. This limit can be set by changing the [index.max_ngram_diff] index level setting."
},
"status" : 400
}
但即便如此,仍然会产生过多的token。而且n-gram产生的token的起点比较无意义,所以一般不用n-gram。
n-gram filter同理,不过因为它只是filter,所以要搭配一个tokenizer使用:
1
2
3
4
5
6
7
8
9
10
11
12
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "ngram",
"min_gram": 3,
"max_gram": 4
}
],
"text": "hello world"
}
产生的结果token比n-gram tokenizer好一些,因为standard tokenizer把字符串分成了俩token,每个token能产生的n-gram就变少了:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
{
"tokens" : [
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hell",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "ell",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "ello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "llo",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "wor",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "worl",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "orl",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "orld",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "rld",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
edge n-gram tokenizer/filter - 左窗口固定为edge的滑动窗口
edge n-gram tokenizer和edge n-gram filter和n-gram tokenizer/filter相比,左窗口(起点)固定为edge,所以它产生的都是前缀,比较有意义。
同样,edge n-gram tokenizer把整个字符串当成一个keyword,然后从边缘开始,分成无数个token:
1
2
3
4
5
6
7
8
GET _analyze
{
"tokenizer": {
"type": "edge_ngram",
"max_gram": 5
},
"text": "hello world"
}
由于不像n-gram一样从每一个位置都做滑动窗口,所以产生的token要少很多,因此也就没有min和max的差距不能超过1的限制:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
"tokens" : [
{
"token" : "h",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "he",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "hell",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 3
},
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 4
}
]
}
edge n-gram filter则同样要搭配一个tokenizer使用:
1
2
3
4
5
6
7
8
9
10
11
12
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 4
}
],
"text": "hello world"
}
输出:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
"tokens" : [
{
"token" : "he",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hell",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "wo",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "wor",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "worl",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
standard tokenizer把字符串分成hello和world两个token,edge n-gram filter则把他们映射为了6个token,每个token都是原有token的2-4个字符不等的前缀。
这样当我们输入字符的时候,只要搜索框在输入大于两个字符时就对es发起搜索请求,并把搜索结果实时展示在搜索框下,就会产生一种search as you type的效果!(如果搜索返回速度还赶不上用户输入速度,那就凉凉了……)
因为这个功能太强了,所以es支持在定义mapping的时候,给某个field设为”search_as_you_type”类型:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html
直接就能用,甚至都不需要自定义一个包含edge n-gram filter的analyzer了。
当输入“he wor”的时候,既匹配上了he,又匹配上了wor,该文档就会被作为候选项筛选出来。
单词前缀
edge n-gram tokenizer/filter默认把整个字符串当做keyword,所以只能做句子的前缀。
可以理解为edge n-gram tokenizer = keyword tokenizer + edge n-gram filter
如果想做每一个单词的前缀,可以使用standard tokenizer先分词,再使用edge n-gram filter过滤:
1
2
3
4
5
6
7
8
9
10
11
12
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 4
}
],
"text": "hello world"
}
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
"tokens" : [
{
"token" : "he",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "hell",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "wo",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "wor",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "worl",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
也可以直接配置edge n-gram tokenized把单词作为分词的token并产生ngram,同时不要filter:
1
2
3
4
5
6
7
8
9
GET _analyze
{
"tokenizer": {
"type": "edge_ngram",
"max_gram": 5,
"token_chars": ["letter"]
},
"text": "hello world"
}
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
"tokens" : [
{
"token" : "he",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "hel",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "hell",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "wo",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 3
},
{
"token" : "wor",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 4
},
{
"token" : "worl",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 5
}
]
}
二者产生的token都是一样的,不过offset信息不太一样,毕竟一个是tokenizer,一个是filter。
如果把tokenizer和filter一起用呢?
n-gram filter和edge n-gram filter需要搭配tokenizer才能使用(tokenizer必须有且只能有一个),那么tokenizer可不可以是n-gram tokenizer和edge n-gram tokenizer呢?当然是可以的:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
GET _analyze
{
"tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4
},
"filter": [
{
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 4
}
],
"text": "hello world"
}
不过这么搞的意义就没那么大了。
truncate token filter - 输入词过长
上述自定义的filter只生成了2-4长度的前缀作为token,如果用户输入了hello或者world,反而匹配不到这些前缀了,在用户看来这岂不是很离谱?
其本质原因是查询词超出了edge n-gram的max_gram
。这个时候可以给analyzer加一个truncate token filter使用,自动帮用户截断搜索词到max_gram
的长度,又可以搜到了:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html#analysis-edgengram-tokenfilter-max-gram-limits
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html
所以做search as you type的时候,用户输入应该搞成keyword+截断,存储的数据应该搞成edge n-gram:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"analysis":{
"filter":{
"sentence_char_trunc": {
"type": "truncate",
"length": 20
}
},
"analyzer":{
"autocomplete_sentence":{
"tokenizer":"sentence_edge_ngram",
"filter":[
"lowercase"
]
},
"autocomplete_sentence_search":{
"tokenizer":"keyword",
"filter":[
"lowercase",
"sentence_char_trunc"
]
}
},
"tokenizer":{
"sentence_edge_ngram":{
"type":"edge_ngram",
"min_gram":2,
"max_gram":20,
"token_chars":[
]
}
},
"char_filter":{
}
}
completion suggester - search as you type
如果按照乱序前缀匹配文档,用edge n-gram,如果用widely known order,用completion suggester:
When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.
不过什么是widely known order? TODO
看起来它也是一个field type,和”search-as-you-type”一样离谱……
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#completion-suggester
一个很大的优势在于,它被优化的巨快,甚至直接放到了内存里:
Ideally, auto-complete functionality should be as fast as a user types to provide instant feedback relevant to what a user has already typed in. Hence, completion suggester is optimized for speed. The suggester uses data structures that enable fast lookups, but are costly to build and are stored in-memory.
search analyzer - 和edge n-gram一起使用
正常情况下,搜索时对搜索词使用的analyzer,应该和构建索引时使用的analyzer一样,这样搜索词token和索引里的待搜索token才能在同一套标准下进行匹配。比如构建索引时使用english analyzer,搜索时也使用english analyzer。
但是edge n-gram不一样,它存的时候用到的analyzer是edge n-gram相关的,存的是前缀。但是搜索的时候,不能对搜索词也应用这个analyzer,否则只要搜索词的某个前缀和索引词的某个前缀相同,就能被搜出来:
比如索引词是pikachu,存在倒排索引里的是pi/pik/pika三个前缀。搜索词是pipi,理论上来讲它不应该搜出来pikachu。但是如果对它也应用edge n-gram,搜索词也变成了三个:pi/pip/pipi。其中,pi这个搜索词是可以匹配上倒排索引里的pi的。但这完全不是我们想要的结果。
analyzer
和search_analyzer
都是定义mapping时,field的属性:
- 前者叫index analyzer;
- 后者叫search analyzer;
定义一个有多个名字的field:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
"name":{
"type":"keyword",
"fields":{
"standard":{
"type":"text",
"analyzer":"standard"
},
"autocomplete":{
"type":"text",
"analyzer":"autocomplete_sentence",
"search_analyzer":"autocomplete_sentence_search"
}
}
}
name.autocomplete
这个field使用自定义的autocomplete_sentence
analyzer作为index analyzer,使用自定义的autocomplete_sentence_search
作为search analyzer。
前者使用edge n-gram,后者就是单纯的keyword tokenizer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
"analyzer":{
"autocomplete_sentence":{
"tokenizer":"sentence_edge_ngram",
"filter":[
"lowercase"
]
},
"autocomplete_sentence_search":{
"tokenizer":"keyword",
"filter":[
"lowercase"
]
}
},
"tokenizer":{
"sentence_edge_ngram":{
"type":"edge_ngram",
"min_gram":2,
"max_gram":20,
"token_chars":[
]
}
}
}
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
shingle - token滑动窗口
shingle是token层级的n-gram:edge n-gram token filter是给一个token的前缀做ngram,shingle是给多个token做ngram。
比如:
1
2
3
4
5
6
7
8
9
10
11
12
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "shingle",
"min_shingle_size": 3,
"max_shingle_size": 4
}
],
"text": "To beyond and halo infinite"
}
能生成4个单独的word,和5个shingle:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
{
"tokens" : [
{
"token" : "To",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "To beyond and",
"start_offset" : 0,
"end_offset" : 13,
"type" : "shingle",
"position" : 0,
"positionLength" : 3
},
{
"token" : "To beyond and halo",
"start_offset" : 0,
"end_offset" : 18,
"type" : "shingle",
"position" : 0,
"positionLength" : 4
},
{
"token" : "beyond",
"start_offset" : 3,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "beyond and halo",
"start_offset" : 3,
"end_offset" : 18,
"type" : "shingle",
"position" : 1,
"positionLength" : 3
},
{
"token" : "beyond and halo infinite",
"start_offset" : 3,
"end_offset" : 27,
"type" : "shingle",
"position" : 1,
"positionLength" : 4
},
{
"token" : "and",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 2
},
{
"token" : "and halo infinite",
"start_offset" : 10,
"end_offset" : 27,
"type" : "shingle",
"position" : 2,
"positionLength" : 3
},
{
"token" : "halo",
"start_offset" : 14,
"end_offset" : 18,
"type" : "word",
"position" : 3
},
{
"token" : "infinite",
"start_offset" : 19,
"end_offset" : 27,
"type" : "word",
"position" : 4
}
]
}
min=3,但是单个单词也出现了,因为还有一个
output_unigrams
的属性,默认为true,会输出原始单个单词。
使用shingle做索引的field,搜索的时候使用同样的analyzer就行了,不像edge n-gram必须设置不同的search analyzer,因为shingle本来就是按照单词匹配的,符合对搜索的认知。
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html
设置带有shingle的analyzer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# analyzer
PUT /puzzle
{
"settings":{
"number_of_shards":1,
"number_of_replicas":1,
"analysis":{
"char_filter":{
"&_to_and":{
"type":"mapping",
"mappings":[
"&=> and "
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"ignore_case":true,
"stopwords":[
"a",
"an",
"are",
"as",
"at",
"be",
"but",
"by",
"for",
"if",
"in",
"into",
"is",
"it",
"no",
"not",
"of",
"on",
"or",
"such",
"that",
"the",
"their",
"then",
"there",
"these",
"they",
"this",
"to",
"was",
"will",
"with"
]
},
"english_keywords":{
"type":"keyword_marker",
"keywords":[
"example"
]
},
"english_stemmer":{
"type":"stemmer",
"language":"english"
},
"english_possessive_stemmer":{
"type":"stemmer",
"language":"possessive_english"
},
"my_shingle_filter":{
"type":"shingle",
"min_shingle_size":2,
"max_shingle_size":2,
"output_unigrams":false
}
},
"analyzer":{
"reb_standard":{
"type":"custom",
"char_filter":[
"&_to_and"
],
"tokenizer":"standard"
},
"reb_english":{
"type":"custom",
"char_filter":[
"&_to_and"
],
"tokenizer":"standard",
"filter":[
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
},
"my_shingle_analyzer":{
"type":"custom",
"char_filter":[
"&_to_and"
],
"tokenizer":"standard",
"filter":[
"lowercase",
"my_shingle_filter"
]
},
"eng_shingle_analyzer":{
"type":"custom",
"char_filter":[
"&_to_and"
],
"tokenizer":"standard",
"filter":[
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer",
"my_shingle_filter"
]
}
}
}
}
}
设置mapping:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
PUT /puzzle/_mapping
{
"properties": {
"article": {
"type": "keyword",
"fields": {
"stan": {
"type": "text",
"analyzer": "standard"
},
"eng": {
"type": "text",
"analyzer": "english"
},
"reb_eng": {
"type": "text",
"analyzer": "reb_english"
},
"reb_stan": {
"type": "text",
"analyzer": "reb_standard"
},
"icu": {
"type": "text",
"analyzer": "icu_analyzer"
},
"ik": {
"type": "text",
"analyzer": "ik_smart"
}
}
}
}
}
PUT /puzzle/_mapping
{
"properties": {
"article": {
"type": "keyword",
"fields": {
"shingle": {
"type": "text",
"analyzer": "eng_shingle_analyzer"
}
}
}
}
}
搜索:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
GET /puzzle/_search
{
"query": {
"bool": {
"must": {
"match": {
"article.reb_eng": {
"query": "puzzles & survival",
"minimum_should_match": "100%"
}
}
},
"should": [
{
"match": {
"article.shingle": "puzzles & survival"
}
}
]
}
}
}
shingle介于match和match_phrase之间:它像match一样不要求所有搜索词都出现(match_phrase虽然可以调slop,但是所有词必须出现),同时不像match丝毫不考虑顺序(shingle考虑了局部顺序,单词局部顺序和shingle匹配的得分会高):
- https://www.elastic.co/guide/cn/elasticsearch/guide/current/shingles.html
实例:自定义analyzer
一些典型的自定义analyzer的样例。
保留#
/@
需求:搜索#hello
。
默认情况下,es会删掉标点符号,所以搜#hello
和搜hello
是一样的:
1
2
3
4
5
GET /_analyze
{
"tokenizer": "standard",
"text": ["hello #world @wtf &emmm ??? ? . , !tanhao ! ! ……我爱你 才怪"]
}
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "world",
"start_offset" : 7,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "wtf",
"start_offset" : 14,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "emmm",
"start_offset" : 19,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "tanhao",
"start_offset" : 36,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "我",
"start_offset" : 49,
"end_offset" : 50,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "爱",
"start_offset" : 50,
"end_offset" : 51,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "你",
"start_offset" : 51,
"end_offset" : 52,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "才",
"start_offset" : 53,
"end_offset" : 54,
"type" : "<IDEOGRAPHIC>",
"position" : 8
},
{
"token" : "怪",
"start_offset" : 54,
"end_offset" : 55,
"type" : "<IDEOGRAPHIC>",
"position" : 9
}
]
}
起作用的主要是standard analyzer里的standard tokenizer,它会按照Unicode Text Segmentation algorithm在分词的时候就把标点删掉。
所以删东西的未必是filter,tokenizer也可以在生成token的时候去掉一些东西。
比如换成whitespace分词,就只是单纯用空格分一下,不会判断word边界,因此也不会删除标点:
1
2
3
4
5
GET /_analyze
{
"tokenizer": "whitespace",
"text": ["hello #world @wtf &emmm ??? ? . , !tanhao ! ! ……我爱你 才怪"]
}
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "#world",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "@wtf",
"start_offset" : 13,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "&emmm",
"start_offset" : 18,
"end_offset" : 23,
"type" : "word",
"position" : 3
},
{
"token" : "???",
"start_offset" : 24,
"end_offset" : 27,
"type" : "word",
"position" : 4
},
{
"token" : "?",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 5
},
{
"token" : ".",
"start_offset" : 31,
"end_offset" : 32,
"type" : "word",
"position" : 6
},
{
"token" : ",",
"start_offset" : 33,
"end_offset" : 34,
"type" : "word",
"position" : 7
},
{
"token" : "!tanhao",
"start_offset" : 35,
"end_offset" : 42,
"type" : "word",
"position" : 8
},
{
"token" : "!",
"start_offset" : 43,
"end_offset" : 44,
"type" : "word",
"position" : 9
},
{
"token" : "!",
"start_offset" : 45,
"end_offset" : 46,
"type" : "word",
"position" : 10
},
{
"token" : "……我爱你",
"start_offset" : 47,
"end_offset" : 52,
"type" : "word",
"position" : 11
},
{
"token" : "才怪",
"start_offset" : 53,
"end_offset" : 55,
"type" : "word",
"position" : 12
}
]
}
可以看到它给出的类型是word,而非更详细的ALPHANUM之类的。
参考这篇文章,如果想保留#/@等符号,需要使用Word delimiter graph token filter保留这些符号:
1
2
3
4
5
6
"filter":{
"hashtag_as_alphanum" : {
"type" : "word_delimiter_graph",
"type_table": ["# => ALPHANUM", "@ => ALPHANUM"]
}
}
但是filter是在tokenizer后生效的,所以只能把tokenizer改成whitespace,否则tokenizer就把标点过滤掉了,filter也无从保留。
word_delimiter_graph
的原理如下:
原本whitespace tokenizer仅按照space生成token,会有很多token,word_delimiter_graph
可以把token前后的标点过滤掉:
1
2
3
4
5
6
GET /_analyze
{
"tokenizer": "whitespace",
"filter": ["word_delimiter_graph"],
"text": ["hello #world @wtf &emmm ??? ? . , !tanhao ! ! ……我爱你 才怪 rainy"]
}
产生的token就和standard tokenizer差不多了:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "world",
"start_offset" : 7,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "wtf",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "emmm",
"start_offset" : 19,
"end_offset" : 23,
"type" : "word",
"position" : 3
},
{
"token" : "tanhao",
"start_offset" : 36,
"end_offset" : 42,
"type" : "word",
"position" : 8
},
{
"token" : "我爱你",
"start_offset" : 49,
"end_offset" : 52,
"type" : "word",
"position" : 11
},
{
"token" : "才怪",
"start_offset" : 53,
"end_offset" : 55,
"type" : "word",
"position" : 12
},
{
"token" : "rainy",
"start_offset" : 56,
"end_offset" : 61,
"type" : "word",
"position" : 13
}
]
}
此时如果我们指定不要去掉#和@,就可以保留#hello
这样的token!
新的analyzer如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "word_delimiter_graph",
"type_table": [
"# => ALPHANUM",
"@ => ALPHANUM"
]
}
],
"text": ["hello #world @wtf &emmm# ??? ? . , !tanhao ! ! ……我爱你 才怪 rainy"]
}
结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "#world",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "@wtf",
"start_offset" : 13,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "emmm#",
"start_offset" : 19,
"end_offset" : 24,
"type" : "word",
"position" : 3
},
{
"token" : "tanhao",
"start_offset" : 37,
"end_offset" : 43,
"type" : "word",
"position" : 8
},
{
"token" : "我爱你",
"start_offset" : 50,
"end_offset" : 53,
"type" : "word",
"position" : 11
},
{
"token" : "才怪",
"start_offset" : 54,
"end_offset" : 56,
"type" : "word",
"position" : 12
},
{
"token" : "rainy",
"start_offset" : 57,
"end_offset" : 62,
"type" : "word",
"position" : 13
}
]
}
token #world
被成功保留了下来。
当然,
emmm#
也被保留了下来。
如果设置"preserve_original": true
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "word_delimiter_graph",
"type_table": [
"# => ALPHANUM",
"@ => ALPHANUM"
],
"preserve_original": true
}
],
"text": ["hello #world @wtf &emmm ??? ? . , !tanhao ! ! ……我爱你 才怪 rainy"]
}
则会把过滤之前带标点的和过滤之后不代标点的都保留下来:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "#world",
"start_offset" : 6,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "@wtf",
"start_offset" : 13,
"end_offset" : 17,
"type" : "word",
"position" : 2
},
{
"token" : "&emmm",
"start_offset" : 18,
"end_offset" : 23,
"type" : "word",
"position" : 3
},
{
"token" : "emmm",
"start_offset" : 19,
"end_offset" : 23,
"type" : "word",
"position" : 3
},
{
"token" : "???",
"start_offset" : 24,
"end_offset" : 27,
"type" : "word",
"position" : 4
},
{
"token" : "?",
"start_offset" : 29,
"end_offset" : 30,
"type" : "word",
"position" : 5
},
{
"token" : ".",
"start_offset" : 31,
"end_offset" : 32,
"type" : "word",
"position" : 6
},
{
"token" : ",",
"start_offset" : 33,
"end_offset" : 34,
"type" : "word",
"position" : 7
},
{
"token" : "!tanhao",
"start_offset" : 35,
"end_offset" : 42,
"type" : "word",
"position" : 8
},
{
"token" : "tanhao",
"start_offset" : 36,
"end_offset" : 42,
"type" : "word",
"position" : 8
},
{
"token" : "!",
"start_offset" : 43,
"end_offset" : 44,
"type" : "word",
"position" : 9
},
{
"token" : "!",
"start_offset" : 45,
"end_offset" : 46,
"type" : "word",
"position" : 10
},
{
"token" : "……我爱你",
"start_offset" : 47,
"end_offset" : 52,
"type" : "word",
"position" : 11
},
{
"token" : "我爱你",
"start_offset" : 49,
"end_offset" : 52,
"type" : "word",
"position" : 11
},
{
"token" : "才怪",
"start_offset" : 53,
"end_offset" : 55,
"type" : "word",
"position" : 12
},
{
"token" : "rainy",
"start_offset" : 56,
"end_offset" : 61,
"type" : "word",
"position" : 13
}
]
}
如果想更强大一些,我们可以借用english analyzer的成分,组一个能够区分词干又能够保留#和@的analyzer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
"analysis":{
"filter":{
"english_keywords":{
"keywords":[],
"type":"keyword_marker"
},
"english_stemmer":{
"type":"stemmer",
"language":"english"
},
"english_possessive_stemmer":{
"type":"stemmer",
"language":"possessive_english"
},
"english_stop":{
"type":"stop",
"stopwords": "_english_"
},
"hashtag_as_alphanum" : {
"type" : "word_delimiter_graph",
"type_table": ["# => ALPHANUM", "@ => ALPHANUM"]
}
},
"analyzer":{
"reb_english":{
"filter":[
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer",
"hashtag_as_alphanum"
],
"char_filter":[
],
"type":"custom",
"tokenizer":"whitespace"
}
},
"tokenizer":{
},
"char_filter":{
}
}