Elasticsearch：basic

Elasticsearch 入门笔记：Docker 环境、文档结构、mapping、常见 field type 和基本 CRUD。

发表于 2022/04/20

作者 puppylpg

28 分钟阅读

Elasticsearch：basic

边写边把之前es的书签清一清~

环境搭建
入门
索引
数据操作
并发修改 - 乐观锁
1. 模拟并发修改冲突

先把最容易混的层次摆出来：ES 的“索引”不是一个概念，而是一串概念链。文档写入 index，按 routing 落到 primary shard；每个 shard 本质上是一个 Lucene index，里面又由多个不可变 segment 组成。后面讲搜索、刷新、聚合、备份，基本都绕不开这条链。

flowchart TD
    D[document: 一条 JSON 数据] --> I[index: 文档集合]
    I --> S1[primary shard 0]
    I --> S2[primary shard 1]
    S1 --> L1[Lucene index]
    L1 --> G1[segment A]
    L1 --> G2[segment B]
    G1 --> INV[倒排索引: token -> doc]
    G1 --> DV[doc_values: doc -> value]

    style D fill:#e3f2fd,stroke:#4dabf7
    style INV fill:#fff3bf,stroke:#f59f00
    style DV fill:#e8f5e9,stroke:#51cf66

环境搭建

先用docker搭建一套环境，一套可以实操的环境，是后面学习的基础。

以7.12.1的es和kibana为例。二者要版本一致。

参考文档：

Kibana 7.12 Docker 文档

之所以选7.12，因为8.0+好像默认有密码认证，所以为了省事儿直接选了7.12。

创建kibana和es沟通的网络：

docker network create elastic

但是在Windows上使用docker启动es container的时候无比不顺利！一启动WSL就崩了……最终从限制es使用内存入手，解决了问题。限制内存之后，大概有两个报错：

  
bootstrap check failure [1] of [2]: initial heap size [67108864] not equal to maximum heap size [536870912]; this can cause resize pauses
bootstrap check failure [2] of [2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

被Windows docker折腾的身心俱疲，已经懒得去管了……直接用两种方法解决：

还是linux好……想念Arch的第N天……想念Debian的第N/2天……

启动es，主要是设置jvm内存启动项，且必须让xms和xmx一致以解决上面的第一个问题；vm.max_map_count另行调大：

pull docker.elastic.co/elasticsearch/elasticsearch:7.12.1
docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e ES_JAVA_OPTS="-Xms1024m -Xmx1024m" docker.elastic.co/elasticsearch/elasticsearch:7.12.1

启动kibana：

docker pull docker.elastic.co/kibana/kibana:7.12.1
docker run --name kib01-test --net elastic -p 5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.12.1

然后就可以在kibana里操作es进行演示了：

本地 Kibana Dev Tools

入门

基本概念

协议：通过http请求增删改查数据，所以能做到语言无关。当然也可以用curl；
文档document：一条数据记录，json格式；
索引index：索引的意思在es里非常灵活，需要根据语境判断：
- 代表存储文档的数据库；
- 如果索引做动词，代表往index里索引一条document；
- 还可以指为文档的一个field创建的索引，使得这个field可搜索。为所有field创建索引是es的默认行为；

“索引”语境	含义	例子
名词 index	存文档的集合	`PUT /users`
动词 index	把文档写入 ES	`PUT /users/_doc/1`
field index	为字段建立可检索结构	`index: true`

各种“索引”可以参考官方员工文档示例。

文档结构

一个文档就是一条数据，它的结构大致可分成两部分：

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-fields.html

metadata

除了显式的由用户定义的存储数据的field，每个文档都有一些metadata：

_index：属于哪个索引；
_id：文档id。如果用户不指定，es就自动生成；
_source：由用户定义的json格式的显式数据；
_routing：路由。文档会根据该字段的值，被hash到相应分片进行存储。默认值就是_id；

路由算法：

routing_factor = num_routing_shards / num_primary_shards
shard_num = (hash(_routing) % num_routing_shards) / routing_factor

官方 _routing 字段文档

flowchart LR
    R[_routing 默认等于 _id] --> H[hash(_routing)]
    H --> V[虚拟分片 num_routing_shards]
    V --> P[primary shard]
    P --> N[data node]

    style R fill:#e3f2fd,stroke:#4dabf7
    style V fill:#fff3bf,stroke:#f59f00
    style P fill:#e8f5e9,stroke:#51cf66

如果自定义了routing（_id和routing不一致），和_id相关的api必须带上routing参数，否则es默认把_id作为routing，导致路由到错误的分片，结果找不到相应_id的文档。比如GET <index>/_doc/<_id>、POST <index>/_update/<_id>，都必须带上?routing=xxx参数！

还有一些其他不重要的metadata。

这里的metadata是mapping的metadata，还有一些metadata比如_verison不定义在mapping里，所以这里没介绍。

顺便列出一堆高版本弃用的metadata：

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/breaking_50_mapping_changes.html#_literal__timestamp_literal_and_literal__ttl_literal
https://www.elastic.co/guide/en/elasticsearch/reference/2.0/mapping-timestamp-field.html
https://www.elastic.co/guide/en/elasticsearch/reference/6.4/mapping-all-field.html
https://www.elastic.co/guide/en/elasticsearch/reference/8.1/mapping-parent-field.html

_timestamp很早就弃用了，现在如果想用一个字段记录文档的最新更新时间，需要：

手动创建一个这样的字段；
创建一个script pipeline更新该字段；
将该pipeline设置为索引的default pipeline；

参考：

https://stackoverflow.com/questions/17136138/how-to-make-elasticsearch-add-the-timestamp-field-to-every-document-in-all-indic/66958236#66958236
https://stackoverflow.com/questions/68286853/elasticsearch-ingest-pipeline-create-and-update-timestamp-field
https://discuss.elastic.co/t/creation-and-update-timestamps-for-each-doc/93456

显式字段：`_source`

用户定义的显式字段放在_source里：

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

这些数据的结构由下面的mapping定义。

非关系型数据

es使用json保存数据。这种结构很灵活，可以有层次性，可以任意添加key value。

es鼓励把数据放在一个文档里，以避免联表查询。即使这样存储会造成一些数据冗余。而不是像mysql一样拆分到不同的数据表中以减少数据冗余。

索引

mapping

Mapping 是文档的结构定义，相当于mysql的schema。但是不同于mysql严格的数据结构定义，mapping支持strict和dynamic两种类型。

mapping 有两层作用：一层是定义字段能不能写，另一层是定义字段写入后会生成什么索引结构。text 会进 analyzer 生成倒排索引，keyword 通常同时有倒排和 doc_values，object 会 flatten，nested 会拆成隐藏子文档。不要只把它看成字段类型列表，它其实决定了后续能不能搜、能不能排序、能不能聚合。

dynamic：虽然可以定义要存储文档的结构，但文档的field如果多于mapping定义，照样会存下来，来什么存什么；
- dynamic: true：新的field会自动加入mapping，相当于直接扩充了mapping的定义。默认行为；
- dynamic: false：新的field会存下来，但不会为该field创建索引，也不可以用该field进行搜索（因为没有它的索引）。但是这个文档被查出来时，是带有这个field的；
dynamic: strict or explicit：explicit mapping我觉得其实就是把dynamic设为strict，不允许数据有多于mapping的字段，否则会报错。和mysql行为一致；

dynamic mapping非常适合用于初期测试，因为使用起来非常方便。但如果用在生产环境中，将会导致数据非常凌乱。所以生产环境建议使用strict。

dynamic：https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic.html

虽然dynamic默认为true，但是对于嵌套field（type=object）来说，它的默认dynamic设置继承自父object，而不是默认的true：

https://www.elastic.co/guide/en/elasticsearch/reference/7.12/dynamic.html#dynamic-inner-objects

mapping type

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/removal-of-types.html

在es 7里，mapping不允许有多种类型。之前一个mapping是可以设置多种type的，每个type有自己的mapping。这其实就相当于把两种文档的field merge起来了，在存储的时候，却只会存其中的一种，会导致空间很稀疏。

es 7里，每个index只能存储一种类型，type默认都叫：_doc。

修改field

https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html#update-mapping

es不允许修改field的类型。如果修改field需要重新创建一个index，然后把旧的index的数据reindex到新的里面。

es也不允许重命名field，这会导致已经索引的数据失效。碰到这种情况，es推荐设置alias：

https://www.elastic.co/guide/en/elasticsearch/reference/current/field-alias.html

es的文档是不可变的（具体后面再介绍TODO），一旦存储就不能修改。这也是field不能改变的原因：如果允许改变，之前的所有数据都失效了。

添加field

https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html#add-field-mapping

添加field在es里是可以的，无需新建index。

为什么不能修改field却可以添加field？反正之前的文档也没有新增的field的数据，所以新增一个field并不会导致之前的文档索引失效。

field type

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

显式定义field的时候，需要指定类型。es定义了很多种类型，可以按需选用。

挑几个重要的说一下：

text

https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html

es里默认存储字符串的类型就是text。text会被分词器拆分为token，建立倒排索引。之后就可以按照token检索这个文档了。

关于text的倒排索引和analyzer：Elasticsearch：search

keyword

https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

和text不同，keyword类型的字符串不被分词器拆分为token，所以不会给这个文档在这个field上建立一个包含其中所有token的倒排索引。

但这并不意味着类型为keyword的这个field不可被搜索：keyword是可以被搜索的，只不过需要用整个字符串去和它做完整匹配。这也意味着其实它其实也建立了倒排索引。

它只是不像text，被切分为了一个个token而已。

在es 5之前，字符串还是用string类型去表示，通过string的not_analyzed属性表示该字符串到底要不要被分词为token。在es 5的时候，string被拆分为了text和keyword：

https://stackoverflow.com/a/53121991/7676237
https://www.elastic.co/cn/blog/strings-are-dead-long-live-strings

keyword除了用来存储string，还可以用来存储只做term查询不做range查询的numeric类型数据，可以获得更好的查询性能。比如product id。查询的时候，可以使用string值，也可以使用numeric值。返回的结果里，_source里该字段返回的是存储时候的值，fields查询该字段返回的是string值。比如：

GET <index>/_search
{
  "query": {
    "terms": {
      "tags": [
        "20601"
      ]
    }
  },
  "_source": true,
  "docvalue_fields": ["tags", "country"],
  "fields": ["tags", "country"]
}

object & nested

object相关的类型：https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html#object-types
- object type: https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html
- nested type：https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

object类型用于存储有层级的数据，本质上它是把嵌套的field flatten了，全都变成了具有前缀的顶层属性。所以它其实不是在存储嵌套数据！

嵌套数据使用nested类型。它和object类型最大的区别是在用于array时：object无法维护field之间的对应关系，但是nested可以。

https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#nested-arrays-flattening-objects

关于object、nested、join，会在Elasticsearch：关系型文档里做更详细的阐述。

带dot的名字

以前，es可以配置是否把点分的名字改成object形式，从es5开始，点分的名字一定会被转换为object形式。

比如"server.latency.max": 100，生成的mapping为：

  
{
  "properties": {
    "server": {
      "type": "object",
      "properties": {
        "latency": {
          "type": "object",
          "properties": {
            "max": {
              "type": "long"
            }
          }
        }
      }
    }
  }
}

但是有一种情况会失败：

  
{
  "properties": {
    "server.latency": {
      "type": "long"
    },
    "server": {
      "type": "string"
    }
  }
}

这种情况下，server既是object（server.latency），又是string（server），是不可能的。

date

https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html

es没有date，它存的要么是字符串时间，要么是数字代表epoch毫秒（可以通过配置修改为秒）。具体用哪种存储，看定义的哪种格式。

date支持两种格式：

自定义格式：使用jdk的DateTimeFormatter解释格式，比如yyyy-MM-dd
- https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
内置格式

可以对一个字段设置多个格式，使用||分隔。es会对给定值一个一个试，符合其中一个格式就可以使用。

比如：

"strict_date_optional_time||epoch_millis"

定义该字段既接收毫秒，又接收字符串，且字符串必须是strict_date_optional_time所规定的格式（必须有年月日yyyy-MM-dd，可以有时分秒毫秒等yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ）：

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html#strict-date-time

这个字符串的格式使用的是ISO 8601（和druid一样）：

https://en.wikipedia.org/wiki/ISO_8601

定义完格式后，只能写入这两种格式，es用该格式存储时间。

使用的时候，如果不指定格式，数据必须符合定义的格式，es才能进行比较。如果指定格式，es会自动做格式转换，然后再比较日期。

比如存储milli：

  
        "timestamp" : {
          "type" : "date",
          "format" : "epoch_millis"
        },

使用数字查询是可以的：

GET <index>/_count
{
  "query": {
    "range": {
      "timestamp": {
        "gte": 100
      }
    }
  }
}

使用其他格式是不行的：

GET <index>/_count
{
  "query": {
    "range": {
      "timestamp": {
        "gte": "2020-02-02"
      }
    }
  }
}

报错：failed to parse date field [2020-02-02] with format [epoch_millis]，因为值不是milli格式的。

但是如果告诉es值的格式，es可以把值按照给定格式解释，转成milli，再和自己存储的值进行比较：

GET <index>/_count
{
  "query": {
    "range": {
      "timestamp": {
        "gte": "2020-02-02",
        "format": "yyyy-MM-dd",
        "time_zone": "+08:00"
      }
    }
  }
}

这个是可以的。

同理，可以给时间做加减：

https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#date-math

加减的基准（anchor date）有两个：

now；
string，同样要给定格式，以供es翻译该stirng；

过去一小时的：

GET <index>/_search
{
  "query": {
    "range": {
      "timestamp": {
        "gte": "now-1h",
        "time_zone": "+08:00"
      }
    }
  }
}

2022.09.16前两秒内（不含）的：

  
GET witake_media/_search
{
  "query": {
    "range": {
      "timestamp": {
        "gt": "2022.09.16||-2s",
        "lt": "2022.09.16", 
        "format": "yyyy.MM.dd", 
        "time_zone": "+08:00"
      }
    }
  }
}

或者：

GET <index>/_search
{
  "query": {
    "range": {
      "timestamp": {
        "gt": "2022-09-16T00:00:00||-2s",
        "lt": "2022-09-16T00:00:00", 
        "format": "yyyy-MM-dd'T'HH:mm:ss", 
        "time_zone": "+08:00"
      }
    }
  }
}

而这种横杆的格式，就是上面说的strict_date_optional_time，说以可以直接用这个内置格式：

  
GET witake_media/_search
{
  "query": {
    "range": {
      "timestamp": {
        "gt": "2022-09-16T00:00:00||-2s",
        "lt": "2022-09-16T00:00:00", 
        "format": "strict_date_optional_time", 
        "time_zone": "+08:00"
      }
    }
  }
}

时间加减还支持向下取整：

  
/d: Round down to the nearest day

获取昨天零点到今天零点的：

GET <index>/_count
{
  "query": {
    "range": {
      "timestamp": {
        "gte": "now-1d/d",
        "lt": "now/d"
      }
    }
  }
}

这个对时间的处理方式还是挺教科书的！

array

es没有数组。任何一个field都可以存放0个或多个值，只要这多个值的类型一致。

In Elasticsearch, there is no dedicated array data type. Any field can contain zero or more values by default, however, all values in the array must be of the same data type.

所以在mapping上，无需定义array。

https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html

multi field - 同一field索引多次

https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

如果一个field要做多种处理，可以给它创建multi field。这一个field相当于变成了一组field，不同的field有不同的功能，可以按照需要，查询这组field里的某个field。

index template

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html

index主要由两部分组成：

mapping：定义数据结构；
settings：index设置。比如有几个副本replica、分成多少个分片存储、多久刷新一次；

除此之外，还有一些其他设置。

可以给index创建个默认模板，之后再创建index，主要定义mapping就行了。如果需要不同的设定，比如不同的副本、分片数，只需要显式定义一下，覆盖默认配置就行了。

数据操作

增 PUT/POST

就是往索引（名次）里索引（动词）数据。可以使用put或post，不同的方法格式不同。具体参考文档：

index api：https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

PUT和POST的唯一区别就是put必须制定id，而post可以不指定id，让es帮忙生成：如果不指定文档id（_id），es将默认生成一个id。

在知道_id的情况下，PUT既可以新建又可以“修改文档”（暴力替换），非常方便！

批量操作`_bulk`

es还支持批量添加，能极大分摊每个文档的传输网络延迟：

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

改 POST

es的文档是不可变的。所以所谓的update，其实就是先删掉旧文档，再增加新文档。这两个动作都是es内部完成的。

update只能在知道id的情况下使用：https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
update by query则是更新查找到的文档，所以有一个搜索的动作：https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html

更新内容相同的文档

上面说了，index api的PUT操作既可以add，也可以暴力替换文档，即使内容一模一样：

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-noop

because the index API doesn’t fetch the old source and isn’t able to compare it against the new source.
当然也可以给index设置op_type=create，达到create if not exist的效果，但是如果exist，会报错。

但是update api在内容相同的情况下会返回noop，不做更新：

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#_detect_noop_updates

如果update接口不想检测noop：设置detect_noop为false。

查 GET

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html

查询是es的主要功能。默认es的每个field都是可查的。

es可支持的查询非常丰富：

term：完整匹配，常用于查keyword类型；
match：token匹配，用于查text类型；
range：范围匹配；
bool：查询条件组合；

等等。

参考：Elasticsearch：search

不止查询需要query：update by query，delete by query，也要用到query。

get request with body

es查询用的是get方法，但是却有body，用于指定一堆查询条件。看起来好像很离谱。但实际上，rfc并没有说get不能带body：

https://stackoverflow.com/questions/36939748/elasticsearch-get-request-with-request-body

删 DELETE/POST

delete：https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html
delete by query：https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

不删索引，只删里面的所有数据：

POST /<index>/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}

关闭索引 - 留着不查，删了可惜

_close。关闭就只放在磁盘上，不加载到内存里，不可查。只有它的metadata会放到内存里。_open之后就可查了。

并发修改 - 乐观锁

当多个请求并发修改一个文档，会引起并发问题。比如两个update操作。或者一个update，另一个delete。如果不加控制，就会冲突，比如经常提到的ABA问题。

这俩操作都不是原子操作，很可能修改的时候，文档版本已经不是query时候的版本了，说明这个过程中，文档已经被别的请求修改过了：

update by query
delete by query

es默认给文档加了版本，也就是乐观锁。每修改一次，版本+1，如果update by query（非原子操作）的时候发现版本变了，就是冲突了，需要放弃修改，重试。

_version在response里可以看到：

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#docs-index-api-response-body

es也支持修改的时候自己传入版本号（version_type=external&version=xxx），用于乐观锁的值。适用于数据源是另一个存储（比如mysql），且在那里已经显示用了乐观锁自己维护数据并发修改的情况。

delete by query也一样：可能会版本冲突，因为是先创建个要删除的索引的snapshot，再删除，这中间文档可能已经变了：

If a document changes between the time that the snapshot is taken and the delete operation is processed, it results in a version conflict and the delete operation fails.

模拟并发修改冲突

request 1先发送，后结束：

POST <index>/_update/5
{
    "script": "Thread.sleep(10000); ctx._source.xxx = 123"
}

request 2后发送，先结束：

POST <index>/_update/5
{
    "script": "ctx._source.yyy = 'zzz'"
}

二者都修改id=5的文档，此时第一个请求会失败。

update api可以设置retry_on_conflict参数，进行失败重试，默认为0：

(Optional, integer) Specify how many times should the operation be retried when a conflict occurs. Default: 0.

update by query api则是设置conflicts参数，如果conflicts=proceed，则会一直重试，直到成功，默认是abort：

(Optional, string) What to do if update by query hits version conflicts: abort or proceed. Defaults to abort.

You can opt to count version conflicts instead of halting and returning by setting conflicts to proceed.

返回值有version_conflicts，代表冲突的次数：

The number of version conflicts that the update by query hit.

elasticsearch

本文由作者按照 CC BY 4.0 进行授权