文章

Elasticsearch:backup

虽然elasticsearch是分布式系统,但挡不住误操作啊!一不小心把数据写坏了、删错了,分布式系统也没辙。为了不让多年努力毁于一旦,必须花时间做个备份。手中有备份,自然可以随便造。

  1. hdfs plugin
  2. data dump
  3. backup to hdfs
    1. 配置repo
      1. hadoop name node
      2. name node HA
    2. 查看仓库
    3. backup策略
      1. 手动备份
      2. 自动备份SLM
    4. 备份进度
    5. retention:snapshot保留策略
    6. 增量备份
  4. restore
    1. 恢复普通索引
    2. 恢复系统索引
    3. 恢复整个集群
    4. 恢复到不同集群

hdfs plugin

elasticsearch支持往很多地方备份,如果数据量比较大,往hdfs备份是个很合适的选择。而且hdfs本身也是个分布式系统,可以让备份更安全。

首先需要给elasticsearch安装hdfs插件:

  • https://www.elastic.co/guide/en/elasticsearch/plugins/8.4/repository-hdfs.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[liuhaibo@th077.corp.yodao.com elasticsearch-7.12.0]$ bin/elasticsearch-plugin install repository-hdfs
warning: usage of JAVA_HOME is deprecated, use ES_JAVA_HOME
Future versions of Elasticsearch will require Java 11; your Java version from [/global/share/java/jdk1.8.0_66/jre] does not meet this requirement. Consider switching to a distribution of Elasticsearch with a bundled JDK. If you are already using a distribution with a bundled JDK, ensure the JAVA_HOME environment variable is not set.
-> Installing repository-hdfs
-> Downloading repository-hdfs from elastic
[=================================================] 100%   
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.lang.RuntimePermission accessClassInPackage.sun.security.krb5
* java.lang.RuntimePermission accessDeclaredMembers
* java.lang.RuntimePermission getClassLoader
* java.lang.RuntimePermission loadLibrary.jaas
* java.lang.RuntimePermission loadLibrary.jaas_nt
* java.lang.RuntimePermission loadLibrary.jaas_unix
* java.lang.RuntimePermission setContextClassLoader
* java.lang.reflect.ReflectPermission suppressAccessChecks
* java.net.SocketPermission * connect,resolve
* java.net.SocketPermission localhost:0 listen,resolve
* java.security.SecurityPermission insertProvider
* java.security.SecurityPermission putProviderProperty.SaslPlainServer
* java.util.PropertyPermission * read,write
* javax.security.auth.AuthPermission doAs
* javax.security.auth.AuthPermission getSubject
* javax.security.auth.AuthPermission modifyPrincipals
* javax.security.auth.AuthPermission modifyPrivateCredentials
* javax.security.auth.AuthPermission modifyPublicCredentials
* javax.security.auth.PrivateCredentialPermission javax.security.auth.kerberos.KerberosTicket * "*" read
* javax.security.auth.PrivateCredentialPermission javax.security.auth.kerberos.KeyTab * "*" read
* javax.security.auth.PrivateCredentialPermission org.apache.hadoop.security.Credentials * "*" read
* javax.security.auth.kerberos.ServicePermission * initiate
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed repository-hdfs
-> Please restart Elasticsearch to activate any plugins installed
[liuhaibo@th077.corp.yodao.com elasticsearch-7.12.0]$ ls plugins/
repository-hdfs

data dump

造一些数据,可以从集群里使用elasticdump拷一些过来:

1
[liuhaibo@th077.corp.yd.com ~]$ /global/share/esdump/elasticdump --input=https://[name]:[password]@[host]/witake_media --output=http://localhost:9200/witake_media --type=data --limit=1000

elasticdump只能在th077执行……所以es也只能起在th077上……elasticsearch想暴露到公网ip还是挺麻烦的,因为要设置vm,需要root权限。一个不错的方法是暴露kibana,从而可以避免暴露elasticsearch到公网ip。

如果使用remote reindex,需要配置白名单并重启。使用elasticdump可以免去这些步骤,但是需要调大一些batch,默认的100太慢了,1000不错。

使用elasticdump可以不提前创建好index,相当于直接使用dynamic mapping了。

backup to hdfs

  1. 配置repo;
  2. 备份:
    1. 手动备份;
    2. 周期性备份:配置备份策略;

配置repo

参考Configuration Properties

hadoop name node

文档没有说明nameservices的配置方法。所以先用当前active nn的ip+port:

1
2
3
4
5
6
7
8
9
10
PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://zj066:8000",
    "path": "/user/liuhaibo/overseas-elasticsearch-backup/my_hdfs_repository",
    "conf.dfs.client.read.shortcircuit": "true",
    "conf.dfs.domain.socket.path": "/var/run/hdfs-sockets/dn"
  }
}

使用name node的时候注意不能用standby nn,standby不支持写入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "error" : {
    "root_cause" : [
      {
        "type" : "repository_exception",
        "reason" : "[every5] cannot create blob store"
      }
    ],
    "type" : "repository_verification_exception",
    "reason" : "[every5] path  is not accessible on master node",
    "caused_by" : {
      "type" : "repository_exception",
      "reason" : "[every5] cannot create blob store",
      "caused_by" : {
        "type" : "unchecked_i_o_exception",
        "reason" : "Cannot create HDFS repository for uri [hdfs://zj065:8000]",
        "caused_by" : {
          "type" : "remote_exception",
          "reason" : "Operation category WRITE is not supported in state standby\n\tat org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1779)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1313)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3891)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:984)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)\n"
        }
      }
    }
  },
  "status" : 500
}

name node HA

但是直接用nn地址不靠谱,不一定啥时候就主从切换了。所以还是应该用hdfs nameservices。

然而官方文档过时了,在API里指定conf_location没有用,依然识别不了这个nameservice:

1
2
3
4
5
6
7
8
9
10
11
PUT _snapshot/eadata
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://eadata-hdfs",
    "path": "/user/liuhaibo/overseas-elasticsearch-backup/eadata",
    "conf.dfs.client.read.shortcircuit": "true",
    "conf.dfs.domain.socket.path": "/var/run/hdfs-sockets/dn",
    "conf_location": "hdfs-site.xml,core-site.xml"
  }
}

elasticsearch.yml下指定这些配置更离谱,elasticsearch直接不识别这些设置,启动不了。

最终发现可以在API里逐一指定hdfs-site.xml里的每一个conf,一定不要漏了dfs.client.failover.proxy.provider.eadata-hdfs这一项,漏了就识别不了nameservice

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
PUT _snapshot/eadata
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://eadata-hdfs/",
    "path": "/user/liuhaibo/overseas-elasticsearch-backup/eadata",
    "conf.dfs.nameservices": "eadata-hdfs",
    "conf.dfs.ha.namenodes.eadata-hdfs": "nn65,nn66",
    "conf.dfs.namenode.rpc-address.eadata-hdfs.nn65": "eadata-nn0:8000",
    "conf.dfs.namenode.rpc-address.eadata-hdfs.nn66": "eadata-nn1:8000",
    "conf.dfs.client.failover.proxy.provider.eadata-hdfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
    "conf.dfs.client.read.shortcircuit": "true",
    "conf.dfs.domain.socket.path": "/var/run/hdfs-sockets/dn"
  }
}

dfs.client.failover.proxy.provider: The prefix (plus a required nameservice ID) for the class name of the configured Failover proxy provider for the host. For normal HA mode, please consult the “Configuration Details” section of the HDFS High Availability documentation. For observer reading mode, please choose a custom class–ObserverReadProxyProvider.

hdfs-site.xml

core-site.xml

  • https://www.jianshu.com/p/7fcb258e2cfc

之后就可以以HA的方式正常使用hdfs了!

查看仓库

查看刚刚配置的备份仓库:

1
GET _snapshot/eadata

结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
  "eadata" : {
    "type" : "hdfs",
    "uuid" : "zQa6GAYyRiSHVqymRv5mXw",
    "settings" : {
      "path" : "/user/liuhaibo/overseas-elasticsearch-backup/eadata",
      "uri" : "hdfs://eadata-hdfs/",
      "conf" : {
        "dfs" : {
          "client" : {
            "failover" : {
              "proxy" : {
                "provider" : {
                  "eadata-hdfs" : "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
                }
              }
            },
            "read" : {
              "shortcircuit" : "true"
            }
          },
          "ha" : {
            "namenodes" : {
              "eadata-hdfs" : "nn65,nn66"
            }
          },
          "namenode" : {
            "rpc-address" : {
              "eadata-hdfs" : {
                "nn65" : "eadata-nn0:8000",
                "nn66" : "eadata-nn1:8000"
              }
            }
          },
          "domain" : {
            "socket" : {
              "path" : "/var/run/hdfs-sockets/dn"
            }
          },
          "nameservices" : "eadata-hdfs"
        }
      }
    }
  }
}

backup策略

有了仓库,就可以配置备份策略,推荐自动备份。

手动备份

1
2
3
4
5
6
7
GET _snapshot
PUT _snapshot/my_hdfs_repository/manually_backup?wait_for_completion=true

GET _snapshot/my_hdfs_repository/_all
GET _snapshot/my_hdfs_repository/_current
GET _snapshot/my_hdfs_repository/manually_backup
GET _snapshot/_status

自动备份SLM

指定自动备份策略:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
PUT _slm/policy/every15min-snapshots
{
  "schedule": "0 /15 * * * ?",       
  "name": "<ha-every15mim-snap-{now/m{yyyy-MM-dd_HH-mm|+08:00}}>", 
  "repository": "eadata",    
  "config": {
    "indices": "*",                 
    "include_global_state": true    
  },
  "retention": {                    
    "expire_after": "1d",
    "min_count": 5,
    "max_count": 100
  }
}

elasticsearch的crontab只支持UTC时区All schedule times are in coordinated universal time (UTC); other timezones are not supported.

相关SLM的属性

  • expire_after删除过期的snapshot;
  • 但最少会留min_count个,即使过期也留;
  • 最多会留max_count个,即使不过期也删。

snapshot的命名支持date math,方便在名称里加入时间。但必须以urlencode的形式传入,比如:

1
2
3
4
# PUT <hello-{now/m{yyyy-MM-dd_HH-mm-SS|+08:00}}>
PUT %3Chello-%7Bnow%2Fm%7Byyyy-MM-dd_HH-mm-SS%7C%2B08%3A00%7D%7D%3E

GET hello*

crontab最低间隔15min,再小就会报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "invalid schedule [0 /5 * * * ?]: schedule would be too frequent, executing more than every [15m]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "invalid schedule [0 /5 * * * ?]: schedule would be too frequent, executing more than every [15m]"
  },
  "status" : 400
}

SLM也可以手动立即执行一次:

1
POST _slm/policy/every15min-snapshots/_execute

查看SLM策略:GET _slm/policy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
{
  "every15min-snapshots" : {
    "version" : 1,
    "modified_date_millis" : 1666001016145,
    "policy" : {
      "name" : "<ha-every15mim-snap-{now/m{yyyy-MM-dd_HH-mm|+08:00}}>",
      "schedule" : "0 /15 * * * ?",
      "repository" : "eadata",
      "config" : {
        "indices" : "*",
        "include_global_state" : true
      },
      "retention" : {
        "expire_after" : "1d",
        "min_count" : 5,
        "max_count" : 100
      }
    },
    "last_success" : {
      "snapshot_name" : "ha-every15mim-snap-2022-10-19_11-45-zjzcuoxttoogzakdakivsa",
      "time" : 1666151103243
    },
    "next_execution_millis" : 1666152000000,
    "stats" : {
      "policy" : "every15min-snapshots",
      "snapshots_taken" : 168,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 63,
      "snapshot_deletion_failures" : 0
    }
  },
  "every5min-snapshots" : {
    "version" : 1,
    "modified_date_millis" : 1665994323580,
    "policy" : {
      "name" : "<every15min-snap-{now/m{yyyy-MM-dd_HH-mm|+08:00}}>",
      "schedule" : "0 /15 * * * ?",
      "repository" : "every5",
      "config" : {
        "indices" : "*",
        "include_global_state" : true
      },
      "retention" : {
        "expire_after" : "30d",
        "min_count" : 5,
        "max_count" : 50
      }
    },
    "last_success" : {
      "snapshot_name" : "every15min-snap-2022-10-19_11-45-gdbkba2stywfzr0jmo8owq",
      "time" : 1666151103142
    },
    "next_execution_millis" : 1666152000000,
    "stats" : {
      "policy" : "every5min-snapshots",
      "snapshots_taken" : 175,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 96,
      "snapshot_deletion_failures" : 0
    }
  }
}

查看已经备份的snapshot内容:GET _snapshot/eadata/_all

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
{
  "snapshots" : [
    {
      "snapshot" : "ha-every15mim-snap-2022-10-17_19-15-ubh9fwcfsvkvged48wmo8a",
      "uuid" : "vZudgmjVTV6lWbXIxG7Dbg",
      "version_id" : 7120099,
      "version" : "7.12.0",
      "indices" : [
        ".apm-custom-link",
        ".ds-ilm-history-5-2022.10.17-000001",
        ".ds-.slm-history-5-2022.10.17-000001",
        "hello-2022-10-17-16-08-00",
        "witake_media",
        ".kibana-event-log-7.12.0-000001",
        "hello-2022-10-17_16-09-00",
        ".tasks",
        ".kibana_7.12.0_001",
        ".apm-agent-configuration",
        ".kibana_task_manager_7.12.0_001"
      ],
      "data_streams" : [
        "ilm-history-5",
        ".slm-history-5"
      ],
      "include_global_state" : true,
      "metadata" : {
        "policy" : "every15min-snapshots"
      },
      "state" : "SUCCESS",
      "start_time" : "2022-10-17T11:14:59.941Z",
      "start_time_in_millis" : 1666005299941,
      "end_time" : "2022-10-17T11:15:10.350Z",
      "end_time_in_millis" : 1666005310350,
      "duration_in_millis" : 10409,
      "failures" : [ ],
      "shards" : {
        "total" : 11,
        "failed" : 0,
        "successful" : 11
      },
      "feature_states" : [
        {
          "feature_name" : "kibana",
          "indices" : [
            ".kibana_7.12.0_001",
            ".kibana_task_manager_7.12.0_001",
            ".apm-agent-configuration",
            ".apm-custom-link"
          ]
        },
        {
          "feature_name" : "tasks",
          "indices" : [
            ".tasks"
          ]
        }
      ]
    },
    {
      "snapshot" : "ha-every15mim-snap-2022-10-17_19-30-gb-s9fzosmgxsh2v0tu_qq",
      "uuid" : "PvXUheF-QqSQ94Y08FzCkQ",
      ...
    },
    {
      "snapshot" : "ha-every15mim-snap-2022-10-17_20-00-kqqw3fahs_-ucsm8nsp3vw",
      "uuid" : "Z6qWAZQFTJ6d_21kvIR4iw",
      ...
    }
  ]
}

每一个snapshot里备份的所有的index和feature_state都可以看出来

备份进度

查看当前正在备份的snapshot:GET _snapshot/everyday/_current

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
{
  "snapshots" : [
    {
      "snapshot" : "everyday-snapshot-2022-10-19_00-egvjkae-tjywfkpc6qyrng",
      "uuid" : "h-TpgzTLQ8GcxOWaPfBNZg",
      "version_id" : 7120099,
      "version" : "7.12.0",
      "indices" : [
        "url_info_v1",
        ".kibana_7.12.1_001",
        "stored_kol_media_v3",
        ".monitoring-es-7-2022.10.13",
        "witake_media_v5",
        ".kibana_task_manager_7.12.1_001",
        ".kibana_7.12.0_001",
        ".kibana-event-log-7.12.1-000001",
        "brand_info_v1",
        ".kibana-event-log-7.12.1-000002",
        ".monitoring-kibana-7-2022.10.13",
        ".monitoring-kibana-7-2022.10.15",
        ".monitoring-kibana-7-2022.10.14",
        ".kibana_task_manager_7.12.0_001",
        ".apm-custom-link",
        "app_rank_history_v2",
        ".ds-ilm-history-5-2022.08.26-000015",
        "app_info_history_v2",
        ".apm-agent-configuration",
        ".kibana-event-log-7.12.0-000015",
        ".kibana-event-log-7.12.0-000016",
        ".enrich-branding_lookup_policy-1666161026024",
        ".monitoring-kibana-7-2022.10.17",
        ".monitoring-kibana-7-2022.10.16",
        ".kibana-event-log-7.12.0-000013",
        ".monitoring-kibana-7-2022.10.19",
        ".monitoring-kibana-7-2022.10.18",
        ".kibana-event-log-7.12.0-000014",
        ".tasks",
        ".enrich-url_lookup_policy-1666146626034",
        ".ds-ilm-history-5-2022.07.27-000014",
        ".monitoring-es-7-2022.10.19",
        ".ds-ilm-history-5-2022.06.27-000013",
        ".monitoring-es-7-2022.10.18",
        ".monitoring-es-7-2022.10.17",
        ".monitoring-es-7-2022.10.16",
        "stored_kol_v8",
        "witake_media_v4",
        ".monitoring-es-7-2022.10.15",
        ".monitoring-es-7-2022.10.14",
        "stored_kol_v7",
        ".async-search",
        ".ds-ilm-history-5-2022.09.25-000016"
      ],
      "data_streams" : [
        "ilm-history-5"
      ],
      "include_global_state" : true,
      "metadata" : {
        "policy" : "everyday-snapshots"
      },
      "state" : "IN_PROGRESS",
      "start_time" : "2022-10-19T09:19:08.362Z",
      "start_time_in_millis" : 1666171148362,
      "end_time" : "1970-01-01T00:00:00.000Z",
      "end_time_in_millis" : 0,
      "duration_in_millis" : 0,
      "failures" : [ ],
      "shards" : {
        "total" : 0,
        "failed" : 0,
        "successful" : 0
      },
      "feature_states" : [
        {
          "feature_name" : "async_search",
          "indices" : [
            ".async-search"
          ]
        },
        {
          "feature_name" : "enrich",
          "indices" : [
            ".enrich-branding_lookup_policy-1666161026024",
            ".enrich-url_lookup_policy-1666146626034"
          ]
        },
        {
          "feature_name" : "kibana",
          "indices" : [
            ".kibana_7.12.1_001",
            ".kibana_task_manager_7.12.1_001",
            ".kibana_task_manager_7.12.0_001",
            ".kibana_7.12.0_001",
            ".apm-agent-configuration",
            ".apm-custom-link"
          ]
        },
        {
          "feature_name" : "tasks",
          "indices" : [
            ".tasks"
          ]
        }
      ]
    }
  ]
}

备份的进度:GET _snapshot/everyday/_status

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
{
  "snapshots" : [
    {
      "snapshot" : "everyday-snapshot-2022-10-19_00-egvjkae-tjywfkpc6qyrng",
      "repository" : "everyday",
      "uuid" : "h-TpgzTLQ8GcxOWaPfBNZg",
      "state" : "STARTED",
      "include_global_state" : true,
      "shards_stats" : {
        "initializing" : 0,
        "started" : 140,
        "finalizing" : 0,
        "done" : 39,
        "failed" : 0,
        "total" : 179
      },
      "stats" : {
        "incremental" : {
          "file_count" : 20518,
          "size_in_bytes" : 4598046889632
        },
        "processed" : {
          "file_count" : 1709,
          "size_in_bytes" : 83222638657
        },
        "total" : {
          "file_count" : 20518,
          "size_in_bytes" : 4598046889632
        },
        "start_time_in_millis" : 1666171148362,
        "time_in_millis" : 562577
      },
      "indices" : {
        ...
      }
    }
  ]
}

从上面的纵览数据可以看出来,单次备份了大概3T+数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
      "stats" : {
        "incremental" : {
          "file_count" : 16367,
          "size_in_bytes" : 3470277692127
        },
        "processed" : {
          "file_count" : 11123,
          "size_in_bytes" : 2256919725081
        },
        "total" : {
          "file_count" : 16367,
          "size_in_bytes" : 3470277692127
        },
        "start_time_in_millis" : 1666324114415,
        "time_in_millis" : 13812493
      },

由于是第一次备份,incremental和total相等

retention:snapshot保留策略

虽然设置了max_count,但snapshot数量可能临时超出该值。因为清理snapshot的是一个retention task,默认是UTC 01:30清理,也就是我们早上九点半。

可以立即执行一波:POST _slm/_execute_retention

增量备份

elasticsearch的备份是增量的,因为elasticsearch的备份本质上是在copy segment。而segment又是不可变的,所以下次备份只需要copy系统里自上次备份以来新增的segment即可,非常方便。

To back up an index, a snapshot makes a copy of the index’s segments and stores them in the snapshot repository. Since segments are immutable, the snapshot only needs to copy any new segments created since the repository’s last snapshot.

因此,删除旧的snapshot只是逻辑上的删除,因为它里面的某些segment可能和后面的snapshot共用。只有snapshot独有的segment才能被删除

Each snapshot is also logically independent. When you delete a snapshot, Elasticsearch only deletes the segments used exclusively by that snapshot. Elasticsearch doesn’t delete segments used by other snapshots in the repository.

第二次备份的时候应该会标记一下第一次snapshot里的哪些segment在第二次snapshot里仍然存在。类似于标记引用,当没有标记指向这个snapshot时,它就可以被删了。

备份后hdfs上的实际数据结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
liuhaibo@zj068 /user/liuhaibo/overseas-elasticsearch-backup$ du -h every5/
10.5 K   every5/index-7
8        every5/index.latest
612.0 M  every5/indices
27.7 K   every5/meta-117HwMy8SLWT-wiaerB_yA.dat
27.9 K   every5/meta-F_IHNPNmTlSfxR3OfSmd2A.dat
27.9 K   every5/meta-Ga_7mub3RPmelY2iGqxikw.dat
27.9 K   every5/meta-HSC_MWVxQk6kf-je8EEr-Q.dat
27.5 K   every5/meta-KHd3Jj3cRNanWr606CW-jA.dat
27.9 K   every5/meta-LUq4r2iCTQqwVu63rnLOXQ.dat
27.9 K   every5/meta-NNVW_tAYSLOMIUaowJALWw.dat
27.9 K   every5/meta-eBLmQ_ttRsyC1Tnl_vjrHg.dat
735      every5/snap-117HwMy8SLWT-wiaerB_yA.dat
787      every5/snap-F_IHNPNmTlSfxR3OfSmd2A.dat
787      every5/snap-Ga_7mub3RPmelY2iGqxikw.dat
787      every5/snap-HSC_MWVxQk6kf-je8EEr-Q.dat
735      every5/snap-KHd3Jj3cRNanWr606CW-jA.dat
787      every5/snap-LUq4r2iCTQqwVu63rnLOXQ.dat
787      every5/snap-NNVW_tAYSLOMIUaowJALWw.dat
787      every5/snap-eBLmQ_ttRsyC1Tnl_vjrHg.dat

restore

elasticsearch可以从snapshot恢复所有或任意index,但是elasticsearch里不能有已存在的index,所以要么删掉以后的,要么恢复的时候重命名:

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-restore-snapshot.html#restore-index-data-stream

恢复普通索引

先删掉索引,再使用备份恢复功能:

1
DELETE witake_media

查看snapshot的信息,主要是为了获取snapshot的名称:

1
GET _snapshot/eadata/_all

拿其中一个snapshot举例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
    {
      "snapshot" : "ha-every15mim-snap-2022-10-19_11-45-zjzcuoxttoogzakdakivsa",
      "uuid" : "3W4c715_QsaCs3IjG9NVkA",
      "version_id" : 7120099,
      "version" : "7.12.0",
      "indices" : [
        ".apm-custom-link",
        ".ds-ilm-history-5-2022.10.17-000001",
        ".ds-.slm-history-5-2022.10.17-000001",
        "hello-2022-10-17-16-08-00",
        "witake_media",
        "hello-2022-10-17_16-09-00",
        ".kibana-event-log-7.12.0-000001",
        ".apm-agent-configuration",
        ".kibana_7.12.0_001",
        ".tasks",
        ".kibana_task_manager_7.12.0_001"
      ],
      "data_streams" : [
        "ilm-history-5",
        ".slm-history-5"
      ],
      "include_global_state" : true,
      "metadata" : {
        "policy" : "every15min-snapshots"
      },
      "state" : "SUCCESS",
      "start_time" : "2022-10-19T03:44:59.933Z",
      "start_time_in_millis" : 1666151099933,
      "end_time" : "2022-10-19T03:45:02.734Z",
      "end_time_in_millis" : 1666151102734,
      "duration_in_millis" : 2801,
      "failures" : [ ],
      "shards" : {
        "total" : 11,
        "failed" : 0,
        "successful" : 11
      },
      "feature_states" : [
        {
          "feature_name" : "kibana",
          "indices" : [
            ".kibana_7.12.0_001",
            ".kibana_task_manager_7.12.0_001",
            ".apm-agent-configuration",
            ".apm-custom-link"
          ]
        },
        {
          "feature_name" : "tasks",
          "indices" : [
            ".tasks"
          ]
        }
      ]
    },

snapshot名字是ha-every15mim-snap-2022-10-19_11-45-zjzcuoxttoogzakdakivsa,里面包含好几个索引,包括系统索引。

只恢复witake_media:

1
2
3
4
POST _snapshot/eadata/ha-every15mim-snap-2022-10-19_11-45-zjzcuoxttoogzakdakivsa/_restore
{
  "indices": "witake_media"
}

恢复的过程中,shard不可用,如果搜索会报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "error" : {
    "root_cause" : [
      {
        "type" : "no_shard_available_action_exception",
        "reason" : "[th077.corp.yodao.com][127.0.0.1:9300][indices:data/read/search[phase/query]]",
        "index_uuid" : "70FCXoYrQoG_RoOMEoVDYQ",
        "shard" : "0",
        "index" : "witake_media"
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "witake_media",
        "node" : "mqRkT59kSvmYFKM6Aa66DA",
        "reason" : {
          "type" : "no_shard_available_action_exception",
          "reason" : "[th077.corp.yodao.com][127.0.0.1:9300][indices:data/read/search[phase/query]]",
          "index_uuid" : "70FCXoYrQoG_RoOMEoVDYQ",
          "shard" : "0",
          "index" : "witake_media"
        }
      }
    ]
  },
  "status" : 503
}

可以查看索引的恢复进度:GET witake_media/_recovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{
  "witake_media" : {
    "shards" : [
      {
        "id" : 0,
        "type" : "SNAPSHOT",
        "stage" : "DONE",
        "primary" : true,
        "start_time_in_millis" : 1666152244278,
        "stop_time_in_millis" : 1666152288176,
        "total_time_in_millis" : 43898,
        "source" : {
          "repository" : "eadata",
          "snapshot" : "ha-every15mim-snap-2022-10-19_11-45-zjzcuoxttoogzakdakivsa",
          "version" : "7.12.0",
          "index" : "witake_media",
          "restoreUUID" : "XXTcWrCoRIKAmX8Ugpp7xA"
        },
        "target" : {
          "id" : "mqRkT59kSvmYFKM6Aa66DA",
          "host" : "127.0.0.1",
          "transport_address" : "127.0.0.1:9300",
          "ip" : "127.0.0.1",
          "name" : "th077.corp.yodao.com"
        },
        "index" : {
          "size" : {
            "total_in_bytes" : 1816877753,
            "reused_in_bytes" : 0,
            "recovered_in_bytes" : 1816877753,
            "percent" : "100.0%"
          },
          "files" : {
            "total" : 134,
            "reused" : 0,
            "recovered" : 134,
            "percent" : "100.0%"
          },
          "total_time_in_millis" : 43828,
          "source_throttle_time_in_millis" : 0,
          "target_throttle_time_in_millis" : 0
        },
        "translog" : {
          "recovered" : 0,
          "total" : 0,
          "percent" : "100.0%",
          "total_on_start" : 0,
          "total_time_in_millis" : 43
        },
        "verify_index" : {
          "check_index_time_in_millis" : 0,
          "total_time_in_millis" : 0
        }
      }
    ]
  }
}

不过最好还是在kibana的overview里查看,恢复过程中的关键信息都可视化了。

恢复完索引就可用了。

也可以在恢复的时候进行重命名:

1
2
3
4
5
6
POST _snapshot/everyday/everyday-snapshot-2022-10-20_11-cwqt4ttkraqdn0uzj4cuvq/_restore
{
  "indices": "witake_media_v4",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

恢复witake_media_v4为resotred_witake_media_v4。

restore snapshot api:https://www.elastic.co/guide/en/elasticsearch/reference/current/restore-snapshot-api.html

恢复系统索引

也可以恢复系统索引。比如上述snapshot的feature_states表明了它包含的feature,和feature涉及到的系统索引:

  • kibana
  • tasks

可以恢复这些feature包含的系统索引:

1
2
3
4
POST _snapshot/eadata/ha-every15mim-snap-2022-10-19_11-45-zjzcuoxttoogzakdakivsa/_restore
{
  "feature_states": [ "tasks" ]
}

恢复整个集群

如果整个elasticsearch集群都灾难性故障了……

  • https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-restore-snapshot.html#restore-entire-cluster

elasticsearch本身就是分布式系统,如果需要恢复整个集群,那说明可能(多个)机房一起炸了。

恢复到不同集群

比恢复整个集群更常见的应该是恢复到另一个集群,比如在做集群迁移的时候。

迁移集群的时候,数据从snapshot恢复的速度远比reindex要快,所以非常适合数据量比较大的场景。而且恢复的时候索引的mapping、setting、alias都是可以直接恢复的,非常方便!

恢复前首先要给新集群配置repository,这会涉及到多个es集群使用同一个备份仓库,所以最好先把新旧集群的repository都配置为只读

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PUT _snapshot/everyday
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://ed-hdfs",
    "path": "/user/edop/overseas-elasticsearch-backup/everyday",
    "conf.dfs.nameservices": "ed-hdfs",
    "conf.dfs.ha.namenodes.ed-hdfs": "nn65,nn66",
    "conf.dfs.namenode.rpc-address.ed-hdfs.nn65": "ed-nn0:8000",
    "conf.dfs.namenode.rpc-address.ed-hdfs.nn66": "ed-nn1:8000",
    "conf.dfs.client.failover.proxy.provider.ed-hdfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
    "conf.dfs.client.read.shortcircuit": "true",
    "conf.dfs.domain.socket.path": "/var/run/hdfs-sockets/dn",
    "readonly": true
  }
}

注意最后的"readonly": true

然后开始恢复索引即可。

最后别忘了给新集群重新设置readonly为false。

本文由作者按照 CC BY 4.0 进行授权