Elasticsearch 索引数据被删除问题的研究

背景

前段时间帮着客户排查ES相关的问题，客户环境后期接入的数据量比当初规划的多了很多，依据机器资源的使用情况决定对当前ES集群进行扩容；由2data扩充为4data且专门独立出一个master。由于ES集群当前已经存储了TB级别的数据，想要后续对ES集群操作上更轻便一些，所以决定暂时将存储的索引数据（每个data节点存储路径下的indices目录中）提前move到一个临时存储位置Dest。对ES集群扩充操作完毕后，为了测试，这个时候先从Dest中移出一小部分索引数据加载到当前ES集群中的data节点，然后重启ES集群；因为容器存储卷映射配置上出了点问题，导致data节点的分词插件出现错误，所以加载进来的索引均没有成功assigned。重新迁回索引数据，正确处理好容器卷映射的问题后，不经意间通过_cat/indices接口发现所有unassigned索引，心里想着反正是未分配的，且已经将数据拷贝出来了，所以就随手执行了DELETE *索引的操作（当时心里的认知是认为索引的数据以及metaData等信息都是存储在索引文件中的，在data节点加载数据的时候会读取进来并上报给master节点然后进行全局的集群状态更新；所以不认为DELETE *的删除索引操作会出事儿，况且还是删除的未被正常分配的索引）。之后再重新将上述操作的同样的那部分索引数据分别拷贝至ES集群的data节点，重启整个ES集群；重启完成后，这个时候严重的问题出现了，_cluster/health接口无索引恢复的百分比，感觉奇怪；接着马上执行_cat/indices接口，结果无任何索引信息；最后查看每个data节点存储路径下拷贝过来的索引目录也已经不存在了。到这里心里开始慌了，因为搞丢了一部分数据，且这个意外的发生已经超出了自己对于ES这块知识的认知了；后面小心谨慎的处理好了客户环境后，但这个问题需要好好深入的研究下了。所以这篇文档是对上述问题对应的ES内部处理机制的研究记录。

实践与分析

ES 5.6.16
1master + 1data（分别用Intellij IDE源码运行ES实例）
对于上述问题，其实刚开始并没有清晰的目标知道要从ES的哪个模块，哪个类开始研究，所以决定先搭建ES环境重现上述问题，然后从中寻找切入点。搭建1master + 1data两个节点的ES集群，并分别都设置debug日志级别，模拟上述数据被删除的整个操作流程，尝试从debug日志中挖掘有用的信息

[2020-10-09T13:48:48,538][DEBUG][o.e.i.c.IndicesClusterStateService] [master] [[twitter/4fHvcKLSRBuXK4mGTVI9Bg]] cleaning index, no longer part of the metadata

如上，master与data角色的节点debug日志中均发现了上述删除索引数据的日志记录，因此IndicesClusterStateService类以及其中的deleteIndices(...)方法就是研究的重点与切入点。deleteIndices(...)方法体完整如下：

/**
 * Deletes indices (with shard data).
 *
 * @param event cluster change event
 */
private void deleteIndices(final ClusterChangedEvent event) {
    final ClusterState previousState = event.previousState();
    final ClusterState state = event.state();
    final String localNodeId = state.nodes().getLocalNodeId();
    assert localNodeId != null;

    for (Index index : event.indicesDeleted()) {
        if (logger.isDebugEnabled()) {
            logger.debug("[{}] cleaning index, no longer part of the metadata", index);
        }
        AllocatedIndex<? extends Shard> indexService = indicesService.indexService(index);
        final IndexSettings indexSettings;
        if (indexService != null) {
            indexSettings = indexService.getIndexSettings();
            indicesService.removeIndex(index, DELETED, "index no longer part of the metadata");
        } else if (previousState.metaData().hasIndex(index.getName())) {
            // The deleted index was part of the previous cluster state, but not loaded on the local node
            final IndexMetaData metaData = previousState.metaData().index(index);
            indexSettings = new IndexSettings(metaData, settings);
            indicesService.deleteUnassignedIndex("deleted index was not assigned to local node", metaData, state);
        } else {
            // The previous cluster state's metadata also does not contain the index,
            // which is what happens on node startup when an index was deleted while the
            // node was not part of the cluster.  In this case, try reading the index
            // metadata from disk.  If its not there, there is nothing to delete.
            // First, though, verify the precondition for applying this case by
            // asserting that the previous cluster state is not initialized/recovered.
            assert previousState.blocks().hasGlobalBlock(GatewayService.STATE_NOT_RECOVERED_BLOCK);
            final IndexMetaData metaData = indicesService.verifyIndexIsDeleted(index, event.state());
            if (metaData != null) {
                indexSettings = new IndexSettings(metaData, settings);
            } else {
                indexSettings = null;
            }
        }
        if (indexSettings != null) {
            threadPool.generic().execute(new AbstractRunnable() {
                @Override
                public void onFailure(Exception e) {
                    logger.warn(
                        (Supplier<?>) () -> new ParameterizedMessage("[{}] failed to complete pending deletion for index", index), e);
                }

                @Override
                protected void doRun() throws Exception {
                    try {
                        // we are waiting until we can lock the index / all shards on the node and then we ack the delete of the store
                        // to the master. If we can't acquire the locks here immediately there might be a shard of this index still
                        // holding on to the lock due to a "currently canceled recovery" or so. The shard will delete itself BEFORE the
                        // lock is released so it's guaranteed to be deleted by the time we get the lock
                        indicesService.processPendingDeletes(index, indexSettings, new TimeValue(30, TimeUnit.MINUTES));
                    } catch (LockObtainFailedException exc) {
                        logger.warn("[{}] failed to lock all shards for index - timed out after 30 seconds", index);
                    } catch (InterruptedException e) {
                        logger.warn("[{}] failed to lock all shards for index - interrupted", index);
                    }
                }
            });
        }
    }
}

方法接受的参数是ClusterChangedEvent类型，ClusterChangedEvent是对ES集群状态发生变化的一个描述，主要由master节点向其他节点同步状态。for循环中对event.indicesDeleted()结果进行遍历操作，event.indicesDeleted(...)方法体如下：

/**
 * Returns the indices deleted in this event
 */
public List<Index> indicesDeleted() {
    if (previousState.blocks().hasGlobalBlock(GatewayService.STATE_NOT_RECOVERED_BLOCK)) {
        // working off of a non-initialized previous state, so use the tombstones for index deletions
        return indicesDeletedFromTombstones();
    } else {
        // examine the diffs in index metadata between the previous and new cluster states to get the deleted indices
        return indicesDeletedFromClusterState();
    }
}

private List<Index> indicesDeletedFromTombstones() {
    // We look at the full tombstones list to see which indices need to be deleted.  In the case of
    // a valid previous cluster state, indicesDeletedFromClusterState() will be used to get the deleted
    // list, so a diff doesn't make sense here.  When a node (re)joins the cluster, its possible for it
    // to re-process the same deletes or process deletes about indices it never knew about.  This is not
    // an issue because there are safeguards in place in the delete store operation in case the index
    // folder doesn't exist on the file system.
    List<IndexGraveyard.Tombstone> tombstones = state.metaData().indexGraveyard().getTombstones();
    return tombstones.stream().map(IndexGraveyard.Tombstone::getIndex).collect(Collectors.toList());
}

private List<Index> indicesDeletedFromClusterState() {
    // If the new cluster state has a new cluster UUID, the likely scenario is that a node was elected
    // master that has had its data directory wiped out, in which case we don't want to delete the indices and lose data;
    // rather we want to import them as dangling indices instead.  So we check here if the cluster UUID differs from the previous
    // cluster UUID, in which case, we don't want to delete indices that the master erroneously believes shouldn't exist.
    // See test DiscoveryWithServiceDisruptionsIT.testIndicesDeleted()
    // See discussion on https://github.com/elastic/elasticsearch/pull/9952 and
    // https://github.com/elastic/elasticsearch/issues/11665
    if (metaDataChanged() == false || isNewCluster()) {
        return Collections.emptyList();
    }
    List<Index> deleted = null;
    for (ObjectCursor<IndexMetaData> cursor : previousState.metaData().indices().values()) {
        IndexMetaData index = cursor.value;
        IndexMetaData current = state.metaData().index(index.getIndex());
        if (current == null) {
            if (deleted == null) {
                deleted = new ArrayList<>();
            }
            deleted.add(index.getIndex());
        }
    }
    return deleted == null ? Collections.<Index>emptyList() : deleted;
}

实践发现，我们这里重启场景下的event获取的deleted状态的索引主要是通过集群metaData中的tombstones拿到的，这个也很好理解因为ES节点是重启操作，因此不会依赖对比previous与当前集群metaData来获取结果值。现在排查问题的思路就到了tombstones这里了，tombstones表示啥，为啥可以从metaData.indexGraveyard中获取到。经过研究发现，其实每次在执行DELETE删除索引操作时，被删除的索引都会被记录到集群metaData中，内容形式如下（_cluster/state接口获取内容）：

"metadata": {
    "cluster_uuid": "kURWiZwNQ0-jmDqNIQOa9g",
    "templates": {},
    "indices": {},
    "index-graveyard": {  
      "tombstones": [
        {
          "index": {
            "index_name": "twitter",
            "index_uuid": "IR5DYQLLTJKKBGxgal63nQ"
          },
          "delete_date_in_millis": 1602208073269
        }
      ]
    }
  }

同时ES中也专门用IndexGraveyard类来定义被删除的索引，IndexGraveyard直译过来也是索引墓地的意思。这里集中解释下几个名词：

IndexGraveyard（索引墓地）：此类用来表示被删除索引的类
tombstone（墓碑）：被删除的索引
tombstones：被删除的索引的集合，tombstones大小可通过cluster.indices.tombstones.size设置，默认大小为500
dangling indices：表示这类索引其state信息还在磁盘中，但不存在于集群的metaData中（上述操作就属于此类型）

有了这些认识铺垫后，接着研究了ES master节点的持久化存储，在master存储路径下有两个很重要的文件，一个用于记录集群metaData相关信息（global-x.st），一个用于记录master节点相关信息（node-x.st）。通过vim并以16进制的方式分别打开这两个文件：

# global-1.st
00000000: 3fd7 6c17 0573 7461 7465 0000 0001 0000  ?.l..state......
00000010: 0001 3a29 0a05 fa88 6d65 7461 2d64 6174  ..:)....meta-dat
00000020: 61fa 8676 6572 7369 6f6e d08b 636c 7573  a..version..clus
00000030: 7465 725f 7575 6964 5542 7563 7a51 3365  ter_uuidUBuczQ3e
00000040: 6353 6757 6e61 7378 7465 476b 7636 6788  cSgWnasxteGkv6g.
00000050: 7465 6d70 6c61 7465 73fa fb8e 696e 6465  templates...inde
00000060: 782d 6772 6176 6579 6172 64fa 8974 6f6d  x-graveyard..tom
00000070: 6273 746f 6e65 73f8 fa84 696e 6465 78fa  bstones...index.
00000080: 8969 6e64 6578 5f6e 616d 6544 6e61 6d65  .index_nameDname
00000090: 7389 696e 6465 785f 7575 6964 5564 5857  s.index_uuidUdXW
000000a0: 4957 4878 7352 6d57 3575 5441 6274 3969  IWHxsRmW5uTAbt9i
000000b0: 6b65 77fb 9464 656c 6574 655f 6461 7465  kew..delete_date
000000c0: 5f69 6e5f 6d69 6c6c 6973 2501 3a43 0e1b  _in_millis%.:C..
000000d0: 1dae fbf9 fbfb fbc0 2893 e800 0000 0000  ........(.......
000000e0: 0000 0028 e8a7 b60a                      ...(....

# node-0.st
00000000: 3fd7 6c17 0573 7461 7465 0000 0001 0000  ?.l..state......
00000010: 0001 3a29 0a05 fa86 6e6f 6465 5f69 6455  ..:)....node_idU
00000020: 6857 5147 786b 3637 5342 2d4d 5575 3874  hWQGxk67SB-MUu8t
00000030: 6548 7173 4c51 fbc0 2893 e800 0000 0000  eHqsLQ..(.......
00000040: 0000 001e e8fb f70a

可以看到其中存储的metaData主要信息，每次master节点启动时会通过MetaDataStateFormat类的read方法从本地global-x.st文件中读取内容并填充进metaData数据结构中。同时当集群metaData发生变更时，master也会及时的将内容更新到本地文件中。所以ES主要是依赖本地文件存储集群相关的元数据。这里global-x以及node-x其中的x表示集群状态发布的version号码，通常是从0往上递增的。

这样我们就搞清楚了IndicesClusterStateService类中deleteIndices(...)方法内event.indicesDeleted()返回值的意义了。接着往下看deleteIndices(...)方法的主体逻辑，因为我们的场景是重启ES节点，所以indexService为null且集群previousState的metaData中不包含tombstones集合中的索引；此时逻辑进入indicesService.verifyIndexIsDeleted(...)方法内，如下：

/**
 * Verify that the contents on disk for the given index is deleted; if not, delete the contents.
 * This method assumes that an index is already deleted in the cluster state and/or explicitly
 * through index tombstones.
 * @param index {@code Index} to make sure its deleted from disk
 * @param clusterState {@code ClusterState} to ensure the index is not part of it
 * @return IndexMetaData for the index loaded from disk
 */
@Override
@Nullable
public IndexMetaData verifyIndexIsDeleted(final Index index, final ClusterState clusterState) {
    // this method should only be called when we know the index (name + uuid) is not part of the cluster state
    if (clusterState.metaData().index(index) != null) {
        throw new IllegalStateException("Cannot delete index [" + index + "], it is still part of the cluster state.");
    }
    if (nodeEnv.hasNodeFile() && FileSystemUtils.exists(nodeEnv.indexPaths(index))) {
        final IndexMetaData metaData;
        try {
            metaData = metaStateService.loadIndexState(index);
        } catch (Exception e) {
            logger.warn((Supplier<?>) () -> new ParameterizedMessage("[{}] failed to load state file from a stale deleted index, folders will be left on disk", index), e);
            return null;
        }
        final IndexSettings indexSettings = buildIndexSettings(metaData);
        try {
            deleteIndexStoreIfDeletionAllowed("stale deleted index", index, indexSettings, ALWAYS_TRUE);
        } catch (Exception e) {
            // we just warn about the exception here because if deleteIndexStoreIfDeletionAllowed
            // throws an exception, it gets added to the list of pending deletes to be tried again
            logger.warn((Supplier<?>) () -> new ParameterizedMessage("[{}] failed to delete index on disk", metaData.getIndex()), e);
        }
        return metaData;
    }
    return null;
}

该方法由名称可知主要是用来验证索引是否被删除，这个删除操作主要是指data节点本地存储路径下的索引目录是否被有效删除。方法内首先判断当前data节点的nodeEnv对象的nodePath与lock是否为null，若均不为null且本地存在index的完整路径（比如我这里的/Users/tony/myelasticsearch/2-5.6.16/home/data/nodes/0/indices/YzNUXHEZQrqe2CVCVu0thg），则进入到if逻辑体内。首先通过metaStateService的loadIndexState(...)方法获取当前索引的metaData，和master节点读取本地global-x.st文件获取集群metaData一样，也是通过MetaDataStateFormat类的read(...)方法从本地存储路径下读取磁盘上的索引目录中的状态文件；接着通过metaData获取indexSettings信息，buildIndexSettings(metaData)方法构建indexSettings对象，buildIndexSettings(...)方法背后就是调用了IndexSettings类的构造方法来生成indexSettings对象；接着进入到deleteIndexStoreIfDeleteAllowed(...)方法内，最终执行删除操作的是NodeEnvironment类的deleteIndexDirectoryUnderLock(...)方法；代码IOUtils.rm(indexPaths)对已存在tombstones中的索引目录文件执行了删除操作，上述操作中master节点重启之后发现拷贝过来的索引目录文件全部不见了的根因就在于此处执行了真正的删除操作。代码逻辑分析到此处，我们就清晰了索引文件被删除的内部处理机制了。梳理下索引目录被删除的流程：

执行DELETE 删除索引的操作，被删除的索引会被写入到集群metaData中的tombstones集合中，且metaData信息是存储在master节点的本地文件中的（global-x.st）
master节点启动时，会从本地路径下读取对应的文件，并将集群信息加载到metaData中
在master节点同步集群状态过程中，会验证处于tombstones中的索引是否被有效删除（本地索引存储目录是否被有效删除）
如果tombstones中的索引文件依然存在，则会在此过程中被删除
上述丢数据的场景就是因为首先执行了DELETE删除操作，这个时候这些deleted状态的索引已经被记录到了metaData中，后面又拷贝索引文件至data节点的路径下，故而会被ES删除掉

小结

到此，结合着实践与代码完整分析了索引数据为啥会被删除的整个逻辑，因为对ES的这块知识把握的不是很精确，导致在操作过程中出现了一部分的数据丢失，有如下两点很深的感悟：

操作数据之前，做好完整的数据备份（使用cp而不是mv）
对一个功能背后的知识点有了足够的掌握之后，再去做进一步的操作

ES集群的状态发布以及ES本地存储文件的详细说明，文档中没有去做进一步的探索，网上找了两篇比较不错的博客，推荐给大家看下；还是那样，与大家一起学习ES，一起进步。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,736评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,167评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,442评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,902评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,302评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,573评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,847评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,562评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,260评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,531评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,021评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,367评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,016评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,068评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,827评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,610评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,514评论 2赞 269

Elasticsearch 索引数据被删除问题的研究

背景

实践与分析

小结

推荐阅读更多精彩内容