Elasticsearch 5.x 源码分析（7）Shard Allocation 的一些小细节

上周听了院长讲了一课ES之后，绕梁三日，并对里面关于节点failover 的逻辑非常好奇，乃至自己这两天捣鼓一下，没事做设定一个很简单的场景，想看ES是怎么走的。看了不少文章翻了不少代码，有收获顾记录之。

本篇打算讨论下面这个很简单的问题：

索引primary shard失效示意图

客户端首先发送文档A到Node1 也就是Primary Shard 索引文档A（简单起见就假设直接发送给Primary），并且文档A已正常同步到所有的Replica
接着客户端发送文档B到Node1 并索引文档B，这时Node1 会同步到Node2 和Node3
假设Node1 到Node3 的链路超时了并且这时Node1 挂了。

其实大家都知道这时Cluster会是红色，因为shard0 的 primary 挂了。那么ES会做下面两件事的哪一件呢？

随便找到一个replica 并推选为primary，如果它保存的数据不是最新的，则replica之间会先进行一次同步（在INITIALIZING 阶段）
ES能找出Node2 的R0是包含了最新代码的，因此直接推举Node2 的R0 为primary

答案是：如果是Elasticsearch 5.x 则是后者，其他版本不详（没看其他版本代码），ES会把当前状态标成RED，并且会把Node2 的R0晋升到primary，然后会把剩下的一个R0 标记成UNassigned，待Node1 重新加入集群时，会为Node1 分配这个UNassigned的R0。

下面就来过一遍代码！

预备知识

这里先分享Jasper兄的两篇博文,这两篇博文详细介绍了ES的Gateway 模块和Allocation 模块，本文是跟着Jasper兄的思路继续走读完剩下的代码而已。

elasticsearch源码分析之Gateway模块
 elasticsearch源码分析之Allocation模块

拜读完后你基本能够了解到ES在什么时候会执行一个Allocation 操作。这里再简单总结一下几个贯穿其中的类：

ShardStateAction 响应Shard State 改变时间的逻辑入口
AllocationService Allocation 的主逻辑类，它回答的问题是index 的shards 应该在nodes 间如何分配，这个类封装了RoutingNodes、GatewayAllocator 等，我们要关注的自然就是里面的applyFailedShards（）方法
Allocator 有很多子类，具体的子类通过makeAllocationDecision（）方法来决定某个策略并会产生一个是否进行allocation的Decision 决策，举个例子，PrimaryShardAllocator会决策一个primary shard 是否应该存放在本 Node 上，Decision 类就会有一堆的枚举结果。
RoutingNodes，用源码上的描述说就是代表了clusterState对象的信息，封装出很多用于本次clusterState变更事件需要用到的很多属性，例如当前的nodesToShards、unassignedShards、assignedShards等

源码分析

从#Jasper 的博文#你应该会了解到一个Cluster初始化会如何去allocate index的所有的shards，那我还是接着来看我的这个场景，当Node1 的P0 挂了，会做些什么事情。
首先从最顶层入口开始

ShardStateAction ::ShardFailedClusterStateTaskExecutor

它有一个逻辑方法execute（），其中一段逻辑是先把当前ClusterState中的shards 归类

ShardRouting matched = currentState.getRoutingTable().getByAllocationId(task.shardId, task.allocationId);
if (matched == null) {
    Set<String> inSyncAllocationIds = indexMetaData.inSyncAllocationIds(task.shardId.id());
    // mark shard copies without routing entries that are in in-sync allocations set only as stale if the reason why
    // they were failed is because a write made it into the primary but not to this copy (which corresponds to
    // the check "primaryTerm > 0").
    if (task.primaryTerm > 0 && inSyncAllocationIds.contains(task.allocationId)) {
        logger.debug("{} marking shard {} as stale (shard failed task: [{}])", task.shardId, task.allocationId, task);
        tasksToBeApplied.add(task);
        staleShardsToBeApplied.add(new StaleShard(task.shardId, task.allocationId));
    } else {
        // tasks that correspond to non-existent shards are marked as successful
        logger.debug("{} ignoring shard failed task [{}] (shard does not exist anymore)", task.shardId, task);
        batchResultBuilder.success(task);
    }
} else {
    // failing a shard also possibly marks it as stale (see IndexMetaDataUpdater)
    logger.debug("{} failing shard {} (shard failed task: [{}])", task.shardId, matched, task);
    tasksToBeApplied.add(task);
    failedShardsToBeApplied.add(new FailedShard(matched, task.message, task.failure));
}

上面代码完成了把所有fail的shards进行归类，并把所有已经处于stale状态的节点都筛选出来，这段代码对于ES < 5 来说都是新概念，所以逐个技术点做介绍；
AllocationId 这个是master 在保存一个具体的index的shard时配置的一个唯一标识一个物理shard data的标识，在代码里用getByAllocationId(shardId, allocationId）可以唯一找到一个ShardRouting 对象，在这个例子里我们可以认为原始的P0，两个R0都具有自己的allocationID，最后Node1 的那个allocationID将会标记为UNassigned.
stalesShards 标识为不是包含最新数据的shard，在ES5 里这种shard是不会被推举成primary的
failedShards 本次clusterState change事件中fail 的shard。
indexMetaData.inSyncAllocationIds ES5 中的cluster state中维护着这样一个Set，在这个里面的allocationId的集合才会被认为是包含最新的data的。

那么用我假设的例子来演绎的话就是：node1 的P0 是failShard，Node3的R0 是staleShard

想要了解更详细信息的可以参考下面这个Elasticsearch 的Blog，这里就是介绍了ES5 的这个新功能，它能维护着一个具有最新数据集合的shards的ID集合

Elasticsearch Internals - Tracking in-sync shard copies

(https://www.elastic.co/blog/tracking-in-sync-shard-copies)
Allocation IDs are assigned by the master during shard allocation and are stored on disk by the data nodes, right next to the actual shard data. The master is responsible for tracking the subset of copies that contain the most recent data. This set of copies, known as in-sync allocation IDs, is stored in the cluster state, which is persisted on all master and data nodes. Changes to the cluster state are backed by Elasticsearch’s consensus implementation, called zen discovery. This ensures that there is a shared understanding in the cluster as to which shard copies are considered as in-sync, implicitly marking those shard copies that are not in the in-sync set as stale.

好了对fail 的shard 分好类后就会调allocationService 的方法

try {
                maybeUpdatedState = applyFailedShards(currentState, failedShardsToBeApplied, staleShardsToBeApplied);
                batchResultBuilder.successes(tasksToBeApplied);
            }

        ClusterState applyFailedShards(ClusterState currentState, List<FailedShard> failedShards, List<StaleShard> staleShards) {
            return allocationService.applyFailedShards(currentState, failedShards, staleShards);
        }

在allocationService的applyFailedShards（）里面最开始做的就是先把所有的staleShards 排除在可以做Routing 之外，并产生出一个临时的clusterState，进而构造出一个RoutingNodes对象。

     ClusterState tmpState = IndexMetaDataUpdater.removeStaleIdsWithoutRoutings(clusterState, staleShards);
     RoutingNodes routingNodes = getMutableRoutingNodes(tmpState);
        // shuffle the unassigned nodes, just so we won't have things like poison failed shards
     routingNodes.unassigned().shuffle();

上面介绍也说了，这个routingNodes解析了一次clusterState，并计算出当前的一些assigned，UNassigned，failed，stale 之类的所有的shards的集合,因此这个构造函数值得拜读一下，构造方法中有两点比较重要的东西需要留意，一是特别注意此时的routingTable 其实已经并没有包含Node3中的R0 这个shard了；二是留意一下在什么时候把一个shard标记为assignedShard。

RoutingNodes构造函数

用我假设的例子来演绎的话就是：Node3的R0 不在assignedShardList 里。

最后就是最核心的方法routingNodes的failShard方法了，因为之前的疑点都在前面方法里找到结果了，所以带着这些结果来理解这个方法就比较好懂了，anyway这个方法从头到结尾都是比较重要的，所以这里也不吝啬全贴出来了

   /**
     * Applies the relevant logic to handle a cancelled or failed shard.
     *
     * Moves the shard to unassigned or completely removes the shard (if relocation target).
     *
     * - If shard is a primary, this also fails initializing replicas.
     * - If shard is an active primary, this also promotes an active replica to primary (if such a replica exists).
     * - If shard is a relocating primary, this also removes the primary relocation target shard.
     * - If shard is a relocating replica, this promotes the replica relocation target to a full initializing replica, removing the
     *   relocation source information. This is possible as peer recovery is always done from the primary.
     * - If shard is a (primary or replica) relocation target, this also clears the relocation information on the source shard.
     *
     */
    public void failShard(Logger logger, ShardRouting failedShard, UnassignedInfo unassignedInfo, IndexMetaData indexMetaData,
                          RoutingChangesObserver routingChangesObserver) {
        ensureMutable();
        assert failedShard.assignedToNode() : "only assigned shards can be failed";
        assert indexMetaData.getIndex().equals(failedShard.index()) :
            "shard failed for unknown index (shard entry: " + failedShard + ")";
        assert getByAllocationId(failedShard.shardId(), failedShard.allocationId().getId()) == failedShard :
            "shard routing to fail does not exist in routing table, expected: " + failedShard + " but was: " +
                getByAllocationId(failedShard.shardId(), failedShard.allocationId().getId());

        logger.debug("{} failing shard {} with unassigned info ({})", failedShard.shardId(), failedShard, unassignedInfo.shortSummary());

        // if this is a primary, fail initializing replicas first (otherwise we move RoutingNodes into an inconsistent state)
        if (failedShard.primary()) {
            List<ShardRouting> assignedShards = assignedShards(failedShard.shardId());
            if (assignedShards.isEmpty() == false) {
                // copy list to prevent ConcurrentModificationException
                for (ShardRouting routing : new ArrayList<>(assignedShards)) {
                    if (!routing.primary() && routing.initializing()) {
                        // re-resolve replica as earlier iteration could have changed source/target of replica relocation
                        ShardRouting replicaShard = getByAllocationId(routing.shardId(), routing.allocationId().getId());
                        assert replicaShard != null : "failed to re-resolve " + routing + " when failing replicas";
                        UnassignedInfo primaryFailedUnassignedInfo = new UnassignedInfo(UnassignedInfo.Reason.PRIMARY_FAILED,
                            "primary failed while replica initializing", null, 0, unassignedInfo.getUnassignedTimeInNanos(),
                            unassignedInfo.getUnassignedTimeInMillis(), false, AllocationStatus.NO_ATTEMPT);
                        failShard(logger, replicaShard, primaryFailedUnassignedInfo, indexMetaData, routingChangesObserver);
                    }
                }
            }
        }

        if (failedShard.relocating()) {
            // find the shard that is initializing on the target node
            ShardRouting targetShard = getByAllocationId(failedShard.shardId(), failedShard.allocationId().getRelocationId());
            assert targetShard.isRelocationTargetOf(failedShard);
            if (failedShard.primary()) {
                logger.trace("{} is removed due to the failure/cancellation of the source shard", targetShard);
                // cancel and remove target shard
                remove(targetShard);
                routingChangesObserver.shardFailed(targetShard, unassignedInfo);
            } else {
                logger.trace("{}, relocation source failed / cancelled, mark as initializing without relocation source", targetShard);
                // promote to initializing shard without relocation source and ensure that removed relocation source
                // is not added back as unassigned shard
                removeRelocationSource(targetShard);
                routingChangesObserver.relocationSourceRemoved(targetShard);
            }
        }

        // fail actual shard
        if (failedShard.initializing()) {
            if (failedShard.relocatingNodeId() == null) {
                if (failedShard.primary()) {
                    // promote active replica to primary if active replica exists (only the case for shadow replicas)
                    ShardRouting activeReplica = activeReplicaWithHighestVersion(failedShard.shardId());
                    if (activeReplica == null) {
                        moveToUnassigned(failedShard, unassignedInfo);
                    } else {
                        movePrimaryToUnassignedAndDemoteToReplica(failedShard, unassignedInfo);
                        promoteReplicaToPrimary(activeReplica, indexMetaData, routingChangesObserver);
                    }
                } else {
                    // initializing shard that is not relocation target, just move to unassigned
                    moveToUnassigned(failedShard, unassignedInfo);
                }
            } else {
                // The shard is a target of a relocating shard. In that case we only need to remove the target shard and cancel the source
                // relocation. No shard is left unassigned
                logger.trace("{} is a relocation target, resolving source to cancel relocation ({})", failedShard,
                    unassignedInfo.shortSummary());
                ShardRouting sourceShard = getByAllocationId(failedShard.shardId(),
                    failedShard.allocationId().getRelocationId());
                assert sourceShard.isRelocationSourceOf(failedShard);
                logger.trace("{}, resolved source to [{}]. canceling relocation ... ({})", failedShard.shardId(), sourceShard,
                    unassignedInfo.shortSummary());
                cancelRelocation(sourceShard);
                remove(failedShard);
            }
            routingChangesObserver.shardFailed(failedShard, unassignedInfo);
        } else {
            assert failedShard.active();
            if (failedShard.primary()) {
                // promote active replica to primary if active replica exists
                ShardRouting activeReplica = activeReplicaWithHighestVersion(failedShard.shardId());
                if (activeReplica == null) {
                    moveToUnassigned(failedShard, unassignedInfo);
                } else {
                    movePrimaryToUnassignedAndDemoteToReplica(failedShard, unassignedInfo);
                    promoteReplicaToPrimary(activeReplica, indexMetaData, routingChangesObserver);
                }
            } else {
                assert failedShard.primary() == false;
                if (failedShard.relocating()) {
                    remove(failedShard);
                } else {
                    moveToUnassigned(failedShard, unassignedInfo);
                }
            }
            routingChangesObserver.shardFailed(failedShard, unassignedInfo);
        }
        assert node(failedShard.currentNodeId()).getByShardId(failedShard.shardId()) == null : "failedShard " + failedShard +
            " was matched but wasn't removed";
    }

上面的几个if可以总结为：

如果fail的是一个primary shard，那么这个shardId的所有replica 都应该标记fail
如果fail 了一个primary 的shard，那么就会从replica中promote 一个replica晋升为primay

答案就是这里，刚刚说了，routingNodes里的assignedShards已经把stale的shards剔除出去了，所以这里只要随便找一个作为primary即可

如果fail 了一个正在relocating的primary的shard，那么会把relocating的目标也清理了（等于源挂了，目标也不要了）
如果fail 了一个正在relocating的replica，那么直接把目标值为initializing 就可以了，源直接删掉

这几点都很好理解，就不解释了。

那么本篇开头设计的例子和推论就基本演绎了一遍并成立了，不过多说一句就是，所谓的保证那些replica 具有最新的数据，这是ES的索引文档内部机理决定的，当ES索引一个doc时，primary 写入完成之后，需要等待quorum个replica完成写入才会认为这个doc已经写入，这个操作是并发异步操作，因此才会出现这个所谓的某些replica是最新的这种问题出现。

https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html

全篇完，如有错漏欢迎指正和交流。

最后编辑于：2017.12.09 00:45:54

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,736评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,167评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,442评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,902评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,302评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,573评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,847评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,562评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,260评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,531评论 2赞 245
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,021评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,367评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,016评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,068评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,827评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,610评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,514评论 2赞 269