Chandy Lamport Algorithm

“The distributed snapshot algorithm described here came about when I visited Chandy, who was then at the University of Texas in Austin. He posed the problem to me over dinner, but we had both had too much wine to think about it right then. The next morning, in the shower, I came up with the solution. When I arrived at Chandy's office, he was waiting for me with the same solution.”

Distributed Snapshots

// 进程p行为，通过向q发出Marker，发起snapshot

begin

p record its state；

then

send one Marker along c after p records its state and before p sends further messages along c

end

//进程q接受Marker后的行为，q记录自身状态，并记录通道c的状态

if q has not recorded its state then

begin

q records its state;

q records the state c as the empty sequence

end

else q records the state of c as the sequence of messages received along c after q’s state was recorded and before q received the marker along c.

Flink 分布式Checkpointing是通过Asynchronous Barrier Snapshots的算法实现的，该算法借鉴了Chandy-Lamport算法的主要思想，同时做了一些改进，这些改进在论文"Lightweight Asynchronous Snapshots for Distributed Dataflows"中进行了详尽的描述，结合这篇论文，我们来看看具体的实现。

Flink流式计算模型中包含Source Operator、Transformation Operators、Sink Operator等三种不同类型的节点如下图所示，分别负责数据的输入、处理、和输出，对应计算拓扑的起点、中间节点和终点。计算模型的介绍不是我们的重点，细节请参考官方文档-Concepts

Asynchronous Barrier Snapshots

Barrier周期性的被注入到所有的Source中，Source节点看到Barrier后，会立即记录自己的状态，然后将Barrier发送到Transformation Operator。

Sink接受Barrier的操作流程与Transformation Oper一样。当所有的Barrier都到达Sink之后，并且所有的Sink也完成了Checkpoint，这一轮Snapshot就完成了。

// 初始化Operator

upon event (Init | input channels, output

channels, fun, init state)

do

state := init_state;

blocked_inputs := {};

inputs := input_channels;

out_puts := out_put channels;

udf := fun;

// 收到Barrier的行为

upon event (receive | input, (barrier))

do

//将当前input通道加入blocked 集合，并block该通道，此通道的消息处理暂停

if input != Nil then

blocked inputs := blocked inputs ∪ {input};

trigger (block | input);

//如果所有的通道都已经被block，说明所有的barrier都已经收到

if blocked inputs = inputs then

blocked inputs := {};

broadcast (send | outputs, (barrier)); //向所有的outputs发出Barrier

trigger (snapshot | state); //记录本节点当前状态

for each inputs as input //解除所有通道的block，继续处理消息

trigger (unblock | input);

When the alignment is skipped, an operator keeps processing all inputs, even after some checkpoint barriers for checkpoint n arrived. That way, the operator also processes elements that belong to checkpoint n+1 before the state snapshot for checkpoint n was taken. On a restore, these records will occur as duplicates, because they are both included in the state snapshot of checkpoint n, and will be replayed as part of the data after checkpoint n.

