文章

Elasticsearch:analyzer

Elasticsearch的搜索,仅仅创建倒排索引是不够的。比如搜索时如果想忽略大小写,一个单纯的倒排并不能做到这一点,必须在倒排之前,对数据进行处理,全部转为小写。进行这些数据处理的,就是analyzer。

  1. analyzer
    1. char_filter
    2. tokenizer
      1. pattern tokenizer
      2. UAX URL email tokenizer
      3. path hierarchy tokenizer
    3. filter
      1. reverse token filter
    4. 自定义analyzer
    5. analyzer的最小元素集合
  2. _analyze API
  3. search as you type
    1. n-gram tokenizer/filter - 字符滑动窗口
    2. edge n-gram tokenizer/filter - 左窗口固定为edge的滑动窗口
      1. 单词前缀
      2. 如果把tokenizer和filter一起用呢?
      3. truncate token filter - 输入词过长
    3. completion suggester - search as you type
    4. search analyzer - 和edge n-gram一起使用
    5. shingle - token滑动窗口
  4. 实例:自定义analyzer
    1. 保留#/@

analyzer

我们要在建立倒排索引之前处理文本,就要使用analyzer定义文本的处理方式。analyzer由三部分组成:

  1. char_filter:字符过滤器。字符串在被分词之前,先对字符做一些过滤。比如替换&为and、去掉html tag等;
  2. tokenizer:分词器。处理字符串,拆分为不同的token
  3. filter:Token 过滤器(感觉起名为token filter会更贴切。可能es觉得本来就是在处理token,所以不用再提token了)。改变token,也可以增删token。比如token小写、转为词根、增加近义词。

es内置了一些analyzer

比如常用的standard analyzer,语言相关的english analyzer等。还可以添加第三方的比如ik analyzer,用于中文分词。

接下来以standard analyzerenglish analyzer为例分别介绍char_filter/tokenizer/filter:

char_filter

es内置的char filter不是很多,也不太常用:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html

比如HTML Strip Character Filter。

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/char-filters.html

standard analyzer和english analyzer都没有使用char filter

tokenizer

内置的tokenizer:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

standard analyzer和english analyzer使用的都是standard tokenizer,它使用空格给英文分词。汉语则是一个汉字分为一个词,失去了语义:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-tokenizer.html

es可以使用analyze api非常方便地测试analyzer及其组件

1
2
3
4
5
6
POST _analyze
POST _analyze
{
  "tokenizer": "standard",
  "text": "hello123 456 world 你好吗"
}

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "tokens" : [
    {
      "token" : "hello123",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "456",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "world",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "你",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "好",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "吗",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    }
  ]
}

标准分词器做的事情还是挺多的,它会按照Unicode Standard Annex #29定义的统一文本处理算法去处理文本,包括丢掉标点之类的。但是它对中文等特定语言的处理并不好,如果要考虑这些语言的语义,则要使用更专业的分词器,比如icu或ik。

另外可以看出分出来的token除了token本身,还有位置信息、类型信息等,match_phrase搜索会使用position

  • max_token_length:默认255。标准分词器对token的长度是有限制的,超过255就在255处分词了。

注意:平时说的icu和ik,其实是分词器tokenizer。他们都提供了使用了该分词器的analyzer。比如ik_smart。

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/icu-tokenizer.html

pattern tokenizer

自己指定pattern,进行分词:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html

UAX URL email tokenizer

给email和url分词:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-uaxurlemail-tokenizer.html

注意默认最大的token是255,url可能会超。

path hierarchy tokenizer

path和path的父目录都会是token:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html

filter

token filter就比较多了,毕竟对token的处理需求还是很多的:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

比如lowercase filter:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html

使用analyze api指定tokenizer和filter测试一下:

1
2
3
4
5
6
GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}

输出:

1
[ the, quick, fox, jumps ]

stop word filter,过滤掉语言中的stop word,比如a/an/the等无意义词

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html

和tokenizer一样,filter也有属性可以自定义

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html#analysis-stop-tokenfilter-configure-parms

standard analyzer使用了lowercase filter,english analyzer则使用了很多filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
PUT /english_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

第一个是stop word filter当不想要某些token时,可以把他们过滤掉。这里设置的stop word集合是es内置的english stop word,_english_

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html#english-stop-words

最后两个filter是stemmer filter获取token的词根形式,可以极大扩大匹配范围:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html

第二个是keyword_marker filter当不想把某个词作为词根时使用

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-marker-tokenfilter.html

reverse token filter

颠倒分词:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html

看起来似乎没用,但是如果要匹配*bar,不如把token倒过来,再匹配rab*,速度极大提升

还有一种奇效:reverse + ngram + reverse:

1
2
3
4
5
6
GET /_analyze
{
  "tokenizer": "standard",
  "filter": ["reverse", "edge_ngram", "reverse"], 
  "text" : "Hello!"
}

从尾部生成edge_ngram。当然,和设置edge_ngram的side=back一个效果。

自定义analyzer

如果所有内置的analyzer都不符合自己的需求,就需要自定义一个analyzer。

改变标准analyzer的任何一点设置,都是在自定义analyzer。一般情况下自定义analyzer,其实就是:

  1. 在标准char_filter/tokenizer/filter的基础上,改一改配置,生成一个新的char_filter/tokenizer/filter
  2. 组装一个或多个char_filter/tokenizer/filter

上面贴的english analyzer,实际就是对tokenizer、filter改了改设置,组装起来的

比如自定义一个支持的token最大长度为5的standard analyzer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

也可以自定义一个char filter,把&转为and:

1
2
3
4
5
6
"char_filter": {
    "&_to_and": {
        "type":       "mapping",
        "mappings": [ "&=> and "]
    }
}

实际是在mapping char_filter的基础上,改了改配置。

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/custom-analyzers.html

最后把这些char_filter/tokenizer/filter,放到custom类型的analyzer里,组装成一个自己的analyzer:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

analyzer的最小元素集合

一个analyzer虽然由三部分组成,但是char_filter和filter可以有任意个(zero or more),tokenizer必须有且只有一个(exactly one)

最典型的例子就是whitespace analyzer,它只有一个tokenizer,whitespace tokenizer,按照空格分词。

_analyze API

测试analyzer,也可以测试char_filter、tokenizer、filter

1
2
3
4
5
6
7
GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "this is a <b>test</b>"
}

或者自定义一个truncate filter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET _analyze
{
  "tokenizer": "keyword",
  "filter": [
    {
      "type": "truncate",
      "length": 7
    },
    "lowercase"
  ],
  "text": [
    "Hello World"
  ]
}

返回:

1
2
3
4
5
6
7
8
9
10
11
{
  "tokens" : [
    {
      "token" : "hello w",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    }
  ]
}

一个比较方便的功能是引用已有的index的某个field的analyzer

1
2
3
4
5
GET /<index>/_analyze
{
  "field": "description", 
  "text": ["hello world"]
}

或者直接指定analyzer名称,无论analyzer还是search_analyzer都可以:

1
2
3
4
5
GET <index>/_analyze
{
  "analyzer": "autocomplete_sentence_search",
  "text": ["hello world"]
}

search as you type

有一些比较猛的tokenizer/filter,因为太强,所以单独拎出来说了。常用来做search as you type。

n-gram tokenizer/filter - 字符滑动窗口

token filter可以改变token,也可以增删token。stop word filter是删除token的例子,n-gram则是增加token的例子。

n-gram tokenizern-gram filter放在一起介绍:

  • n-gram tokenizer分词产生n-gram;
  • n-gram filter在已有分词器分好词的情况下,对每一个token产生n-gram;

tokenizer:

1
2
3
4
5
POST _analyze
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}

产生:

1
[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]

token_chars属性配置了保留哪些词作为token。默认为空数组,即保留整个文本作为一个token,相当于把整个字符串当成一个keywordmin_gram/max_gram默认为1/2。

自定义tokenizer:

1
2
3
4
5
6
7
8
9
GET _analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 2,
    "max_gram": 3
  },
  "text": "hello world"
}

会产生一大堆token:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
{
  "tokens" : [
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ell",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "ello",
      "start_offset" : 1,
      "end_offset" : 5,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "llo",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "llo ",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lo ",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lo w",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "o w",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "o wo",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : " wo",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : " wor",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "wor",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "worl",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "orl",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : "orld",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "rld",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 16
    }
  ]
}

由于n-gram产生的token实在是太炸裂了,所以要求min和max的差值不能超过1,否则会报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [5]. This limit can be set by changing the [index.max_ngram_diff] index level setting."
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [5]. This limit can be set by changing the [index.max_ngram_diff] index level setting."
  },
  "status" : 400
}

但即便如此,仍然会产生过多的token。而且n-gram产生的token的起点比较无意义,所以一般不用n-gram。

n-gram filter同理,不过因为它只是filter,所以要搭配一个tokenizer使用:

1
2
3
4
5
6
7
8
9
10
11
12
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "ngram",
      "min_gram": 3,
      "max_gram": 4
    }
  ],
  "text": "hello world"
}

产生的结果token比n-gram tokenizer好一些,因为standard tokenizer把字符串分成了俩token,每个token能产生的n-gram就变少了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
{
  "tokens" : [
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "llo",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "wor",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "worl",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "orl",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "orld",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "rld",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

edge n-gram tokenizer/filter - 左窗口固定为edge的滑动窗口

edge n-gram tokenizeredge n-gram filter和n-gram tokenizer/filter相比,左窗口(起点)固定为edge,所以它产生的都是前缀,比较有意义

同样,edge n-gram tokenizer把整个字符串当成一个keyword,然后从边缘开始,分成无数个token:

1
2
3
4
5
6
7
8
GET _analyze
{
  "tokenizer": {
    "type": "edge_ngram",
    "max_gram": 5
  },
  "text": "hello world"
}

由于不像n-gram一样从每一个位置都做滑动窗口,所以产生的token要少很多,因此也就没有min和max的差距不能超过1的限制:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
  "tokens" : [
    {
      "token" : "h",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 4
    }
  ]
}

edge n-gram filter则同样要搭配一个tokenizer使用:

1
2
3
4
5
6
7
8
9
10
11
12
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "edge_ngram",
      "min_gram": 2,
      "max_gram": 4
    }
  ],
  "text": "hello world"
}

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "tokens" : [
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "wo",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "wor",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "worl",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

standard tokenizer把字符串分成hello和world两个token,edge n-gram filter则把他们映射为了6个token,每个token都是原有token的2-4个字符不等的前缀。

这样当我们输入字符的时候,只要搜索框在输入大于两个字符时就对es发起搜索请求,并把搜索结果实时展示在搜索框下,就会产生一种search as you type的效果!(如果搜索返回速度还赶不上用户输入速度,那就凉凉了……)

因为这个功能太强了,所以es支持在定义mapping的时候,给某个field设为”search_as_you_type”类型

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html

直接就能用,甚至都不需要自定义一个包含edge n-gram filter的analyzer了。

当输入“he wor”的时候,既匹配上了he,又匹配上了wor,该文档就会被作为候选项筛选出来。

单词前缀

edge n-gram tokenizer/filter默认把整个字符串当做keyword,所以只能做句子的前缀

可以理解为edge n-gram tokenizer = keyword tokenizer + edge n-gram filter

如果想做每一个单词的前缀,可以使用standard tokenizer先分词,再使用edge n-gram filter过滤:

1
2
3
4
5
6
7
8
9
10
11
12
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "edge_ngram",
      "min_gram": 2,
      "max_gram": 4
    }
  ],
  "text": "hello world"
}

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "tokens" : [
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "wo",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "wor",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "worl",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

也可以直接配置edge n-gram tokenized把单词作为分词的token并产生ngram,同时不要filter:

1
2
3
4
5
6
7
8
9
GET _analyze
{
  "tokenizer": {
    "type": "edge_ngram",
    "max_gram": 5,
    "token_chars": ["letter"]
  },
  "text": "hello world"
}

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "tokens" : [
    {
      "token" : "he",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hel",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hell",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "wo",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "wor",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "worl",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    }
  ]
}

二者产生的token都是一样的,不过offset信息不太一样,毕竟一个是tokenizer,一个是filter。

如果把tokenizer和filter一起用呢?

n-gram filter和edge n-gram filter需要搭配tokenizer才能使用(tokenizer必须有且只能有一个),那么tokenizer可不可以是n-gram tokenizer和edge n-gram tokenizer呢?当然是可以的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
GET _analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 4
  },
  "filter": [
    {
      "type": "edge_ngram",
      "min_gram": 2,
      "max_gram": 4
    }
  ],
  "text": "hello world"
}

不过这么搞的意义就没那么大了。

truncate token filter - 输入词过长

上述自定义的filter只生成了2-4长度的前缀作为token,如果用户输入了hello或者world,反而匹配不到这些前缀了,在用户看来这岂不是很离谱?

其本质原因是查询词超出了edge n-gram的max_gram这个时候可以给analyzer加一个truncate token filter使用,自动帮用户截断搜索词到max_gram的长度,又可以搜到了:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html#analysis-edgengram-tokenfilter-max-gram-limits
  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-truncate-tokenfilter.html

所以做search as you type的时候,用户输入应该搞成keyword+截断,存储的数据应该搞成edge n-gram

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
  "analysis":{
    "filter":{
      "sentence_char_trunc": {
        "type": "truncate",
        "length": 20
      }
    },
    "analyzer":{
      "autocomplete_sentence":{
        "tokenizer":"sentence_edge_ngram",
        "filter":[
          "lowercase"
        ]
      },
      "autocomplete_sentence_search":{
        "tokenizer":"keyword",
        "filter":[
          "lowercase",
          "sentence_char_trunc"
        ]
      }
    },
    "tokenizer":{
      "sentence_edge_ngram":{
        "type":"edge_ngram",
        "min_gram":2,
        "max_gram":20,
        "token_chars":[

        ]
      }
    },
    "char_filter":{
    }
  }

completion suggester - search as you type

如果按照乱序前缀匹配文档,用edge n-gram,如果用widely known order,用completion suggester

When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.

不过什么是widely known order? TODO

看起来它也是一个field type,和”search-as-you-type”一样离谱……

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#completion-suggester

一个很大的优势在于,它被优化的巨快,甚至直接放到了内存里

Ideally, auto-complete functionality should be as fast as a user types to provide instant feedback relevant to what a user has already typed in. Hence, completion suggester is optimized for speed. The suggester uses data structures that enable fast lookups, but are costly to build and are stored in-memory.

search analyzer - 和edge n-gram一起使用

正常情况下,搜索时对搜索词使用的analyzer,应该和构建索引时使用的analyzer一样,这样搜索词token和索引里的待搜索token才能在同一套标准下进行匹配。比如构建索引时使用english analyzer,搜索时也使用english analyzer。

但是edge n-gram不一样,它存的时候用到的analyzer是edge n-gram相关的,存的是前缀。但是搜索的时候,不能对搜索词也应用这个analyzer,否则只要搜索词的某个前缀和索引词的某个前缀相同,就能被搜出来

比如索引词是pikachu,存在倒排索引里的是pi/pik/pika三个前缀。搜索词是pipi,理论上来讲它不应该搜出来pikachu。但是如果对它也应用edge n-gram,搜索词也变成了三个:pi/pip/pipi。其中,pi这个搜索词是可以匹配上倒排索引里的pi的。但这完全不是我们想要的结果。

analyzersearch_analyzer都是定义mapping时,field的属性:

  • 前者叫index analyzer
  • 后者叫search analyzer

定义一个有多个名字的field:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
      "name":{
        "type":"keyword",
        "fields":{
          "standard":{
            "type":"text",
            "analyzer":"standard"
          },
          "autocomplete":{
            "type":"text",
            "analyzer":"autocomplete_sentence",
            "search_analyzer":"autocomplete_sentence_search"
          }
        }
      }

name.autocomplete这个field使用自定义的autocomplete_sentence analyzer作为index analyzer,使用自定义的autocomplete_sentence_search作为search analyzer。

前者使用edge n-gram,后者就是单纯的keyword tokenizer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
        "analyzer":{
          "autocomplete_sentence":{
            "tokenizer":"sentence_edge_ngram",
            "filter":[
              "lowercase"
            ]
          },
          "autocomplete_sentence_search":{
            "tokenizer":"keyword",
            "filter":[
              "lowercase"
            ]
          }
        },
        "tokenizer":{
          "sentence_edge_ngram":{
            "type":"edge_ngram",
            "min_gram":2,
            "max_gram":20,
            "token_chars":[
              
            ]
          }
        }
      }
  • https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

shingle - token滑动窗口

shingle是token层级的n-gram:edge n-gram token filter是给一个token的前缀做ngram,shingle是给多个token做ngram

比如:

1
2
3
4
5
6
7
8
9
10
11
12
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "shingle",
      "min_shingle_size": 3,
      "max_shingle_size": 4
    }
  ],
  "text": "To beyond and halo infinite"
}

能生成4个单独的word,和5个shingle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "To beyond and",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "To beyond and halo",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "shingle",
      "position" : 0,
      "positionLength" : 4
    },
    {
      "token" : "beyond",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "beyond and halo",
      "start_offset" : 3,
      "end_offset" : 18,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 3
    },
    {
      "token" : "beyond and halo infinite",
      "start_offset" : 3,
      "end_offset" : 27,
      "type" : "shingle",
      "position" : 1,
      "positionLength" : 4
    },
    {
      "token" : "and",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "and halo infinite",
      "start_offset" : 10,
      "end_offset" : 27,
      "type" : "shingle",
      "position" : 2,
      "positionLength" : 3
    },
    {
      "token" : "halo",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "infinite",
      "start_offset" : 19,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    }
  ]
}

min=3,但是单个单词也出现了,因为还有一个output_unigrams的属性,默认为true,会输出原始单个单词。

使用shingle做索引的field,搜索的时候使用同样的analyzer就行了,不像edge n-gram必须设置不同的search analyzer,因为shingle本来就是按照单词匹配的,符合对搜索的认知

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html

设置带有shingle的analyzer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# analyzer
PUT /puzzle
{
  "settings":{
    "number_of_shards":1,
    "number_of_replicas":1,
    "analysis":{
      "char_filter":{
        "&_to_and":{
          "type":"mapping",
          "mappings":[
            "&=> and "
          ]
        }
      },
      "filter":{
        "english_stop":{
          "type":"stop",
          "ignore_case":true,
          "stopwords":[
            "a",
            "an",
            "are",
            "as",
            "at",
            "be",
            "but",
            "by",
            "for",
            "if",
            "in",
            "into",
            "is",
            "it",
            "no",
            "not",
            "of",
            "on",
            "or",
            "such",
            "that",
            "the",
            "their",
            "then",
            "there",
            "these",
            "they",
            "this",
            "to",
            "was",
            "will",
            "with"
          ]
        },
        "english_keywords":{
          "type":"keyword_marker",
          "keywords":[
            "example"
          ]
        },
        "english_stemmer":{
          "type":"stemmer",
          "language":"english"
        },
        "english_possessive_stemmer":{
          "type":"stemmer",
          "language":"possessive_english"
        },
        "my_shingle_filter":{
          "type":"shingle",
          "min_shingle_size":2,
          "max_shingle_size":2,
          "output_unigrams":false
        }
      },
      "analyzer":{
        "reb_standard":{
          "type":"custom",
          "char_filter":[
            "&_to_and"
          ],
          "tokenizer":"standard"
        },
        "reb_english":{
          "type":"custom",
          "char_filter":[
            "&_to_and"
          ],
          "tokenizer":"standard",
          "filter":[
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        },
        "my_shingle_analyzer":{
          "type":"custom",
          "char_filter":[
            "&_to_and"
          ],
          "tokenizer":"standard",
          "filter":[
            "lowercase",
            "my_shingle_filter"
          ]
        },
        "eng_shingle_analyzer":{
          "type":"custom",
          "char_filter":[
            "&_to_and"
          ],
          "tokenizer":"standard",
          "filter":[
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer",
            "my_shingle_filter"
          ]
        }
      }
    }
  }
}

设置mapping:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
PUT /puzzle/_mapping
{
  "properties": {
    "article": {
      "type": "keyword",
      "fields": {
        "stan": {
          "type": "text",
          "analyzer": "standard"
        },
        "eng": {
          "type": "text",
          "analyzer": "english"
        },
        "reb_eng": {
          "type": "text",
          "analyzer": "reb_english"
        },
        "reb_stan": {
          "type": "text",
          "analyzer": "reb_standard"
        },
        "icu": {
          "type": "text",
          "analyzer": "icu_analyzer"
        },
        "ik": {
          "type": "text",
          "analyzer": "ik_smart"
        }
      }
    }
  }
}
 
PUT /puzzle/_mapping
{
  "properties": {
    "article": {
      "type": "keyword",
      "fields": {
        "shingle": {
          "type": "text",
          "analyzer": "eng_shingle_analyzer"
        }
      }
    }
  }
}

搜索:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
GET /puzzle/_search
{
  "query": {
      "bool": {
         "must": {
            "match": {
               "article.reb_eng": {
                  "query": "puzzles & survival",
                  "minimum_should_match": "100%"
               }
            }
         },
         "should": [
           {
              "match": {
                 "article.shingle": "puzzles & survival"
              }
           }
         ]
      }
   }
}

shingle介于match和match_phrase之间:它像match一样不要求所有搜索词都出现(match_phrase虽然可以调slop,但是所有词必须出现),同时不像match丝毫不考虑顺序(shingle考虑了局部顺序,单词局部顺序和shingle匹配的得分会高):

  • https://www.elastic.co/guide/cn/elasticsearch/guide/current/shingles.html

实例:自定义analyzer

一些典型的自定义analyzer的样例。

保留#/@

需求:搜索#hello

默认情况下,es会删掉标点符号,所以搜#hello和搜hello是一样的:

1
2
3
4
5
GET /_analyze
{
  "tokenizer": "standard",
  "text": ["hello #world @wtf &emmm ???  ? . , !tanhao ! ! ……我爱你 才怪"]
}

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "wtf",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "emmm",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "tanhao",
      "start_offset" : 36,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "我",
      "start_offset" : 49,
      "end_offset" : 50,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "爱",
      "start_offset" : 50,
      "end_offset" : 51,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "你",
      "start_offset" : 51,
      "end_offset" : 52,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "才",
      "start_offset" : 53,
      "end_offset" : 54,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "怪",
      "start_offset" : 54,
      "end_offset" : 55,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    }
  ]
}

起作用的主要是standard analyzer里的standard tokenizer,它会按照Unicode Text Segmentation algorithm在分词的时候就把标点删掉

所以删东西的未必是filter,tokenizer也可以在生成token的时候去掉一些东西。

比如换成whitespace分词,就只是单纯用空格分一下,不会判断word边界,因此也不会删除标点:

1
2
3
4
5
GET /_analyze
{
  "tokenizer": "whitespace",
  "text": ["hello #world @wtf &emmm ???  ? . , !tanhao ! ! ……我爱你 才怪"]
}

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "#world",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "@wtf",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "&emmm",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "???",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "?",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : ".",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : ",",
      "start_offset" : 33,
      "end_offset" : 34,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "!tanhao",
      "start_offset" : 35,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "!",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "!",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "……我爱你",
      "start_offset" : 47,
      "end_offset" : 52,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "才怪",
      "start_offset" : 53,
      "end_offset" : 55,
      "type" : "word",
      "position" : 12
    }
  ]
}

可以看到它给出的类型是word,而非更详细的ALPHANUM之类的。

参考这篇文章,如果想保留#/@等符号,需要使用Word delimiter graph token filter保留这些符号:

1
2
3
4
5
6
"filter":{
  "hashtag_as_alphanum" : {
    "type" : "word_delimiter_graph",
    "type_table": ["# => ALPHANUM", "@ => ALPHANUM"]
  }
}

但是filter是在tokenizer后生效的,所以只能把tokenizer改成whitespace,否则tokenizer就把标点过滤掉了,filter也无从保留。

word_delimiter_graph的原理如下:

原本whitespace tokenizer仅按照space生成token,会有很多token,word_delimiter_graph可以把token前后的标点过滤掉

1
2
3
4
5
6
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": ["word_delimiter_graph"],
  "text": ["hello #world @wtf &emmm ???  ? . , !tanhao ! ! ……我爱你 才怪 rainy"]
}

产生的token就和standard tokenizer差不多了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "wtf",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "emmm",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "tanhao",
      "start_offset" : 36,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "我爱你",
      "start_offset" : 49,
      "end_offset" : 52,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "才怪",
      "start_offset" : 53,
      "end_offset" : 55,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "rainy",
      "start_offset" : 56,
      "end_offset" : 61,
      "type" : "word",
      "position" : 13
    }
  ]
}

此时如果我们指定不要去掉#和@,就可以保留#hello这样的token

新的analyzer如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "word_delimiter_graph",
      "type_table": [
        "# => ALPHANUM",
        "@ => ALPHANUM"
      ]
    }
  ],
  "text": ["hello #world @wtf &emmm# ???  ? . , !tanhao ! ! ……我爱你 才怪 rainy"]
}

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "#world",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "@wtf",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "emmm#",
      "start_offset" : 19,
      "end_offset" : 24,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "tanhao",
      "start_offset" : 37,
      "end_offset" : 43,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "我爱你",
      "start_offset" : 50,
      "end_offset" : 53,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "才怪",
      "start_offset" : 54,
      "end_offset" : 56,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "rainy",
      "start_offset" : 57,
      "end_offset" : 62,
      "type" : "word",
      "position" : 13
    }
  ]
}

token #world被成功保留了下来。

当然,emmm#也被保留了下来。

如果设置"preserve_original": true

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "word_delimiter_graph",
      "type_table": [
        "# => ALPHANUM",
        "@ => ALPHANUM"
      ],
      "preserve_original": true
    }
  ],
  "text": ["hello #world @wtf &emmm ???  ? . , !tanhao ! ! ……我爱你 才怪 rainy"]
}

则会把过滤之前带标点的和过滤之后不代标点的都保留下来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "#world",
      "start_offset" : 6,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "@wtf",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "&emmm",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "emmm",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "???",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "?",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : ".",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : ",",
      "start_offset" : 33,
      "end_offset" : 34,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "!tanhao",
      "start_offset" : 35,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "tanhao",
      "start_offset" : 36,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "!",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "!",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "……我爱你",
      "start_offset" : 47,
      "end_offset" : 52,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "我爱你",
      "start_offset" : 49,
      "end_offset" : 52,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "才怪",
      "start_offset" : 53,
      "end_offset" : 55,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "rainy",
      "start_offset" : 56,
      "end_offset" : 61,
      "type" : "word",
      "position" : 13
    }
  ]
}

如果想更强大一些,我们可以借用english analyzer的成分,组一个能够区分词干又能够保留#和@的analyzer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
  "analysis":{
    "filter":{
      "english_keywords":{
        "keywords":[],
        "type":"keyword_marker"
      },
      "english_stemmer":{
        "type":"stemmer",
        "language":"english"
      },
      "english_possessive_stemmer":{
        "type":"stemmer",
        "language":"possessive_english"
      },
      "english_stop":{
        "type":"stop",
        "stopwords":  "_english_"
      },
      "hashtag_as_alphanum" : {
        "type" : "word_delimiter_graph",
        "type_table": ["# => ALPHANUM", "@ => ALPHANUM"]
      }
    },
    "analyzer":{
      "reb_english":{
        "filter":[
          "english_possessive_stemmer",
          "lowercase",
          "english_stop",
          "english_keywords",
          "english_stemmer",
          "hashtag_as_alphanum"
        ],
        "char_filter":[
        ],
        "type":"custom",
        "tokenizer":"whitespace"
      }
    },
    "tokenizer":{
    },
    "char_filter":{
    }
  }
本文由作者按照 CC BY 4.0 进行授权