【Pipeline】 Recovery

分享一波大数据&Java的学习视频和书籍:
### Java与大数据资源分享

本文来学习Pipeline Recovery的相关知识。
当client写文件时，数据是以block的形式相继顺序写入的。HDFS把block又拆分成很多packet把这些packet传播到write pipeline中的Datanodes。如下图所示：

write pipeline有三个过程：
1）Pipeline setup. The client sends a Write_Block request along the pipeline and the last DataNode sends an acknowledgement back. After receiving the acknowledgement, the pipeline is ready for writing.

2）Data streaming. The data is sent through the pipeline in packets. The client buffers the data until a packet is filled up, and then sends the packet to the pipeline. If the client calls hflush(), then even if a packet is not full, it will nevertheless be sent to the pipeline and the next packet will not be sent until the acknowledgement of the previous hflush’ed packet is received by the client.

3）Close (finalize the replica and shutdown the pipeline). The client waits until all packets have been acknowledged and then sends a close request. All DataNodes in the pipeline change the corresponding replica into the FINALIZED state and report back to the NameNode. The NameNode then changes the block’s state to COMPLETE if at least the configured minimum replication number of DataNodes reported a FINALIZED state of their corresponding replicas.

下面来看pipeline recovery。什么时候发生呢？当一个block is being written时，在上面三个阶段的任意一个阶段如果一个或者多个在pipeline中的datanode发生了错误，那么就会发起pipeline recovery。

Pipeline Recovery：
Pipeline recovery is initiated when one or more DataNodes in the pipeline encounter an error in any of the three stages while a block is being written.

与上面三个阶段相对应的，pipeline recovery也针对这三个不同阶段有不同的恢复行为。

1）Recovery from Pipeline Setup Failure

If the pipeline was created for a new block, the client abandons the block and asks the NameNode for a new block and a new list of DataNodes. The pipeline is reinitialized for the new block.

如果pipeline是因为create a block而创建的话，client abandon了block，并向Namenode请求一个new block和 new list of Datanodes，那么pipeline会为这个new block 重新初始化。

If the pipeline was created to append to a block, the client rebuilds the pipeline with the remaining DataNodes and increments the block’s generation stamp.

如果pipeline是因为append a block而创建的，那么client 利用剩余的Datanodes重新build pipeline并且增加block的GS

2）Recovery from Data Streaming Failure

When a DataNode in the pipeline detects an error (for example, a checksum error or a failure to write to disk), that DataNode takes itself out of the pipeline by closing up all TCP/IP connections. If the data is deemed not corrupted, it also writes buffered data to the relevant block and checksum (METADATA) files.

当一个在pipeline中的datanode检测到error时，这个datanode会把它自己从pipeline中移出去通过关闭所有TCP/IP连接的方式。如果数据看起来没有损坏，它仍会写buffered data到相关的block和metadata文件中。

When the client detects the failure, it stops sending data to the pipeline, and reconstructs a new pipeline using the remaining good DataNodes. As a result, all replicas of the block are bumped up to a new GS.

当client检测到failure时，client停止发送数据到pipeline同时使用剩余的good datanodes 来重新构建一个new pipeline。所有的块副本会提升至一个新的GS

The client resumes sending data packets with this new GS. If the data sent has already been received by some of the DataNodes, they just ignore the packet and pass it downstream in the pipeline.

client继续恢复发送data packets使用这个new GS。如果这些数据已经被某些Datanodes receive过了，那么这些已经receive到数据的datanode只需要忽略这些packet并直接把packet传到pipeline的下游节点。

3）Recovery from Close Failure

When the client detects a failure in the close state, it rebuilds the pipeline with the remaining DataNodes. Each DataNode bumps up the block’s GS and finalizes the replica if it’s not finalized yet.

当client在close 状态下检测到failure时，它会使用剩余datanodes节点 rebuild pipeline。每个datanode会提升block的GS并且会把没有finalize的副本给finalize 掉。

总结：
当一个Datanode is bad时，它会把自己从pipeline中remove掉。在pipeline recovery的过程中， client可能会利用剩下的Datanodes（client可能会使用新的Datanode来替换bad Datanode，也可能不会替换。这取决于DataNode replacement policy）来rebuild一个新的pipeline。 replication monitor会关注与复制block来满足我们配置的replication factor。

DataNode Replacement Policy upon Failure

There are four configurable policies regarding whether to add additional DataNodes to replace the bad ones when setting up a pipeline for recovery with the remaining DataNodes:

DISABLE: Disables DataNode replacement and throws an error (at the server); this acts like NEVER at the client.
NEVER: Never replace a DataNode when a pipeline fails (generally not a desirable action).
DEFAULT: Replace based on the following conditions:
a. Let r be the configured replication number.
b. Let n be the number of existing replica datanodes.
c. Add a new DataNode only if r >= 3 and EITHER
- floor(r/2) >= n; OR
- r > n and the block is hflushed/appended.
ALWAYS: Always add a new DataNode when an existing DataNode failed. This fails if a DataNode can’t be replaced.

参考文章：

https://blog.cloudera.com/

【Pipeline】 Recovery

DataNode Replacement Policy upon Failure

推荐阅读更多精彩内容