HDFS的读写机制

本篇博客讲解了HDFS的读机制和写机制，通过一个实例演示了HDFS的文件存储过程，非常通俗易懂。

1、客户端写文件

下图显示了在读取HDFS上的文件时，客户端、名称节点和数据节点间发生的一些事件以及事件的顺序。

假设一个HDFS客户机想要编写一个大小为248 MB的名为example.txt的文件

209bc3609653458005e81d55994c800c

假设系统块大小配置为128 MB(默认)。因此，客户机将把example.txt文件分成两个块，一个是128 MB(块A)，另一个是120 MB(块B)。

Now, the following protocol will be followed whenever the data is written into HDFS:

At first, the HDFS client will reach out to the NameNode for a Write Request against the two blocks, say, Block A & Block B.（首先，HDFS客户端将针对两个块（例如，块A和块B）向NameNode发出写入请求）
The NameNode will then grant the client the write permission and will provide the IP addresses of the DataNodes where the file blocks will be copied eventually.（然后，NameNode将授予客户端写权限，并提供数据节点的IP地址，最终将在这些节点上复制文件块。）
The selection of IP addresses of DataNodes is purely randomized based on availability, replication factor and rack awareness that we have discussed earlier.（datanode的IP地址的选择是完全随机的，基于我们前面讨论过的可用性、复制因子和机架感知）
Let’s say the replication factor is set to default i.e. 3. Therefore, for each block the NameNode will be providing the client a list of (3) IP addresses of DataNodes. The list will be unique for each block.（假设复制因子设置为默认值，即3。因此，对于每个块，NameNode将向客户端提供一个datanode的(3)IP地址列表。对于每个块，列表都是唯一的。）
Suppose, the NameNode provided following lists of IP addresses to the client:
- For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
- For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
Each block will be copied in three different DataNodes to maintain the replication factor consistent throughout the cluster.（每个块将被复制到三个不同的datanode中，以保持整个cluste的复制因子一致）
Now the whole data copy process will happen in three stages:（现在，整个数据复制过程将分三个阶段进行）

507451d888ffc05b6ed7712f622d70e8

Set up of Pipeline
Data streaming and replication
Shutdown of Pipeline (Acknowledgement stage)

1.1、Set up of Pipeline

在写入块之前，客户端确认每个ip列表中的datanode是否准备好接收数据。在此过程中，客户端通过连接每个块的相应列表中的各个datanode来为每个块创建一个管道。让我们考虑a块。NameNode提供的datanode列表是：

e3e042f1b98b9ab78329cc0d67bf3d51

So, for block A, the client will be performing the following steps to create a pipeline:

The client will choose the first DataNode in the list (DataNode IPs for Block A) which is DataNode 1 and will establish a TCP/IP connection.（客户端将选择列表中的第一个DataNode（块A的DataNode IP），即DataNode 1，并将建立TCP / IP连接）
The client will inform DataNode 1 to be ready to receive the block. It will also provide the IPs of next two DataNodes (4 and 6) to the DataNode 1 where the block is supposed to be replicated.（客户端将通知DataNode 1准备接收数据块。它还将为DataNode 1提供下两个DataNode(4和6)的ip，在DataNode 1中复制块。）
The DataNode 1 will connect to DataNode 4. The DataNode 1 will inform DataNode 4 to be ready to receive the block and will give it the IP of DataNode 6. Then, DataNode 4 will tell DataNode 6 to be ready for receiving the data.（DataNode 1将连接到DataNode 4。DataNode 1将通知DataNode 4准备接收块，并将DataNode 6的IP给它。然后，DataNode 4将告诉DataNode 6准备接收数据。）
Next, the acknowledgement of readiness will follow the reverse sequence, i.e. From the DataNode 6 to 4 and then to 1.（接下来，确认准备就绪将遵循相反的顺序，即从DataNode 6到4，然后到1）
At last DataNode 1 will inform the client that all the DataNodes are ready and a pipeline will be formed between the client, DataNode 1, 4 and 6.（最后，DataNode 1将通知客户端所有的DataNode都准备好了，并在客户端DataNode 1、DataNode 4和DataNode 6之间形成一个管道。）
Now pipeline set up is complete and the client will finally begin the data copy or streaming process.（现在管道设置完成，客户端将最终开始数据复制或流处理。）

1.2、Data Streaming

在创建管道之后，客户机将把数据推入管道。现在，不要忘记在HDFS中，数据是根据复制因子进行复制的。因此，这里块A将被存储到三个datanode，假设复制因子为3。继续，客户机将仅将块(A)复制到DataNode 1。复制总是按顺序由datanode完成。

6e92a2d4cd28c4e6161a51a028994acb

So, the following steps will take place during replication:

Once the block has been written to DataNode 1 by the client, DataNode 1 will connect to DataNode 4.（一旦客户端将数据块写入到DataNode 1, DataNode 1将连接到DataNode 4。）
Then, DataNode 1 will push the block in the pipeline and data will be copied to DataNode 4.（然后，DataNode 1将数据块推送到管道中，数据将被复制到DataNode 4）
Again, DataNode 4 will connect to DataNode 6 and will copy the last replica of the block.（同样，DataNode 4将连接到DataNode 6并复制块的最后一个副本。）

1.3、Shutdown of Pipeline or Acknowledgement stage

一旦将块复制到所有三个DataNode中，将进行一系列确认，以确保客户端和NameNode数据已成功写入。然后，客户端将最终关闭管道以结束TCP会话

总体的具体过程如下：

Client 调用 DistributedFileSystem 对象的 create 方法，创建一个文件输出流（FSDataOutputStream）对象；
通过 DistributedFileSystem 对象与集群的 NameNode 进行一次 RPC 远程调用，在 HDFS 的 Namespace 中创建一个文件条目（Entry），此时该条目没有任何的 Block，NameNode 会返回该数据每个块需要拷贝的 DataNode 地址信息；
通过 FSDataOutputStream 对象，开始向 DataNode 写入数据，数据首先被写入 FSDataOutputStream 对象内部的数据队列中，数据队列由 DataStreamer 使用，它通过选择合适的 DataNode 列表来存储副本，从而要求 NameNode 分配新的 block；
DataStreamer 将数据包以流式传输的方式传输到分配的第一个 DataNode 中，该数据流将数据包存储到第一个 DataNode 中并将其转发到第二个 DataNode 中，接着第二个 DataNode 节点会将数据包转发到第三个 DataNode 节点；
DataNode 确认数据传输完成，最后由第一个 DataNode 通知 client 数据写入成功；
完成向文件写入数据，Client 在文件输出流（FSDataOutputStream）对象上调用 close 方法，完成文件写入；
调用 DistributedFileSystem 对象的 complete 方法，通知 NameNode 文件写入成功，NameNode 会将相关结果记录到 editlog 中。

2、客户端读文件

如下图所示，确认按相反的顺序发生，即从DataNode 6到4，然后到1。最后，DataNode 1将把三个确认(包括它自己的)推入管道，并将其发送给客户机。客户端将通知NameNode数据已被成功写入。NameNode将更新它的元数据，客户机将关闭管道。

8fd1b583810d80467a9354b2e1668cff

Now, following steps will be taking place while reading the file:

The client will reach out to NameNode asking for the block metadata for the file “example.txt”.（客户端将向NameNode请求文件example.txt的块元数据。）
The NameNode will return the list of DataNodes where each block (Block A and B) are stored.（NameNode将返回存储每个块(块A和块B)的datanode列表。）
After that client, will connect to the DataNodes where the blocks are stored.（在该客户端之后，将连接到存储块的datanode。）
The client starts reading data parallel from the DataNodes (Block A from DataNode 1 and Block B from DataNode 3).（客户端开始从DataNode并行读取数据(从DataNode 1读取块A，从DataNode 3读取块B)）
Once the client gets all the required file blocks, it will combine these blocks to form a file.（一旦客户端获得所有需要的文件块，它将组合这些块形成一个文件）

其具体过程总结如下（简单总结一下）：

Client 通过 DistributedFileSystem 对象与集群的 NameNode 进行一次 RPC 远程调用，获取文件 block 位置信息；
NameNode 返回存储的每个块的 DataNode 列表；
Client 将连接到列表中最近的 DataNode；
Client 开始从 DataNode 并行读取数据；
一旦 Client 获得了所有必须的 block，它就会将这些 block 组合起来形成一个文件。

在处理 Client 的读取请求时，HDFS 会利用机架感知选举最接近 Client 位置的副本，这将会减少读取延迟和带宽消耗。

3、参考资料

https://www.edureka.co/blog/apache-hadoop-hdfs-architecture/

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html