ZooKeeper简介

ZooKeeper:分布式应用的分布式协调服务

ZooKeeper是分布式应用的开源协调服务。它公开了一组简单的原语，分布式应用程序可以在实现更高级别的同步、配置维护、组和命名的基础上进行构建。它被设计成易于编程，并使用一个数据模型，它采用了熟悉的文件系统目录树结构。它在Java中运行，并且对Java和C都有绑定。

众所周知，协调服务很难得到正确的结果。他们特别容易犯诸如竞态条件和死锁等错误。ZooKeeper的动机是减轻分布式应用程序从头开始执行协调服务的责任。

设计目标

ZooKeeper是简单的。ZooKeeper允许分布式进程通过一个共享的层级名称空间来相互协调，该名称空间与标准文件系统类似。名称空间由数据寄存器(称为znode)组成，在ZooKeeper中，这些都类似于文件和目录。与典型的存储文件系统不同，ZooKeeper数据保存在内存中，这意味着ZooKeeper可以实现高吞吐量和低延迟数。ZooKeeper的实现为高性能、高可用性、严格有序的访问提供了额外的优势。ZooKeeper的性能方面意味着它可以在大型分布式系统中使用。可靠性方面使它不成为单点故障。严格的排序意味着复杂的同步原语可以在客户端实现。
ZooKeeper是复制的。与它所协调的分布式进程一样，ZooKeeper本身也要在一组名为“集成”的主机上进行复制。

1.png

组成ZooKeeper服务的服务器必须互相了解。它们维护状态的内存映像，以及持久存储中的事务日志和快照。只要有大多数服务器可用，就可以使用ZooKeeper服务。客户端连接到一个ZooKeeper服务器。客户端维护一个TCP连接，通过它发送请求、获取响应、观看事件和发送心跳。如果连接到服务器的TCP连接中断，客户端将连接到另一个服务器。

ZooKeeper是有序的。ZooKeeper用一个数字来标记每个更新，它反映了所有ZooKeeper交易的顺序。后续的操作可以使用命令来实现更高级别的抽象，例如同步原语。
ZooKeeper是快速的。它在“读占主导”的工作负载中尤其快速。ZooKeeper的应用程序运行在成千上万台机器上，而且它的性能最好，比写的更常见，在10:1左右。

数据模型和分层命名空间

由ZooKeeper提供的名称空间非常类似于标准文件系统。名称是由斜杠(/)分隔的路径元素序列。在ZooKeeper的名称空间中的每个节点都通过一条路径被识别。

2.png

节点和临时节点

与标准文件系统不同，ZooKeeper名称空间中的每个节点都可以有与它和子节点相关联的数据。它就像有一个文件系统，允许文件也可以是一个目录。(ZooKeeper的设计目的是存储协调数据:状态信息、配置、位置信息等，因此存储在每个节点的数据通常很小，字节为千字节范围。)我们使用znode这个术语来明确表示我们讨论的是ZooKeeper数据节点。

znode维护一个stat结构，其中包括数据更改、ACL更改和时间戳的版本号，以允许缓存验证和协调更新。每次znode的数据发生变化时，版本号就会增加。例如，每当客户机检索数据时，它也接收数据的版本。

存储在名称空间中的每个znode的数据是用原子方式读取和写入的。读取所有与znode相关的数据字节，而write将替换所有数据。每个节点都有一个访问控制列表(ACL)，它限制了谁可以做什么。

ZooKeeper也有短暂的节点的概念。只要创建znode的会话是活动的，就存在这些znode。当会话结束时，znode被删除。当您想要实现[tbd]时，临时节点是有用的。

条件更新和监控

ZooKeeper有监控的概念。客户可以设置znode监控。当znode改变时，监控将被触发并删除。当一个监控被触发时，客户机接收到一个数据包，表示znode已经发生了变化。如果客户端与ZooKeeper服务器之间的连接断开，客户端将收到本地通知。这些可以用于[tbd]。

担保

ZooKeeper是非常快而且很简单。尽管它的目标是构建更复杂的服务(如同步)，但它提供了一组保证。这些都是:

顺序一致性——来自客户机的更新将按照发送的顺序进行应用。
原子性——更新要么成功，要么失败。没有部分结果。
单个系统映像——客户机将看到服务的相同视图，而不管它连接到的服务器是什么。
可靠性——一旦应用了更新，它就会一直持续，直到客户端覆盖更新。
时效性——在一定的时间内，系统的客户端视图保证是最新的。

简单的API

ZooKeeper的设计目标之一就是提供一个非常简单的编程接口。因此，它只支持这些操作:

create:在树的某个位置创建一个节点
delete:删除节点
exitsts:测试节点是否存在于某个位置
get data:从节点读数据
set data:向节点写数据
get children:检索节点的子节点列表
sync:等待数据被传播

实现

ZooKeeper组件显示了ZooKeeper服务的高级组件。除了请求处理器之外，组成ZooKeeper服务的每个服务器都复制自己的每个组件的副本。

3.png

复制的数据库是包含整个数据树的内存数据库。更新被记录到磁盘以获得可恢复性，在将其应用到内存数据库之前，写入将被序列化到磁盘。

每个ZooKeeper服务器服务客户端。客户端连接到一个服务器以提交i请求。读取请求是从每个服务器数据库的本地副本服务的。请求更改服务的状态，写入请求，由协议协议处理。

作为协议协议的一部分，所有来自客户机的请求都被转发给一个服务器，称为leader。其余的ZooKeeper服务器，称为跟随者，接收来自领导者的消息建议，并同意消息传递。消息传递层负责替换失败的领导者，并与领导者同步追随者。

ZooKeeper使用一个自定义的原子消息协议。由于消息层是原子性的，所以ZooKeeper可以保证本地副本不会出现分歧。当领导者收到一个写请求时，它会计算系统的状态，当它被应用并将其转换成一个捕获这个新状态的事务时。

使用

ZooKeeper的编程接口是故意简单的。但是，您可以实现更高的订单操作，例如同步原语、组成员、所有权等。一些分布式应用程序已经使用它:[tbd:添加来自白皮书和视频演示的应用。

性能

ZooKeeper被设计成具有很高的性能。但真的是这样吗?在雅虎的ZooKeeper开发团队的结果!研究表明它是。它在应用程序中性能特别高，因为写操作涉及到同步所有服务器的状态。

4.png

可靠性

为了显示系统的行为，随着故障被注入，我们运行了一个由7个机器组成的ZooKeeper服务。我们运行了相同的饱和度基准，但这次我们将写百分比保持在30%，这是我们预期工作负载的保守比率。

5.png

从这张图中，我们得到了一些重要的观察结果。首先，如果追随者失败并迅速恢复，那么尽管失败了，但是ZooKeeper能够保持高的吞吐量。但更重要的是，领导选举算法允许系统快速恢复，以防止吞吐量大幅下降。在我们的观察中，动物园管理员需要不到200毫秒的时间来选出一位新的领导。第三，随着追随者的恢复，一旦开始处理请求，ZooKeeper就能够再次提高吞吐量。

原文

ZooKeeper
ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.
Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.
Design Goals
ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.
The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.
ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.
ZooKeeper Service

The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.
ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.
ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.
Data model and the hierarchical namespace
The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.
ZooKeeper's Hierarchical Namespace

Nodes and ephemeral nodes
Unlike standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.
Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.
The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].
Conditional updates and watches
ZooKeeper supports the concept of watches. Clients can set a watch on a znode. A watch will be triggered and removed when the znode changes. When a watch is triggered, the client receives a packet saying that the znode has changed. If the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].
Guarantees
ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.

Atomicity - Updates either succeed or fail. No partial results.

Single System Image - A client will see the same view of the service regardless of the server that it connects to.

Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.

Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

For more information on these, and how they can be used, see [tbd]
Simple API
One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:
create

creates a node at a location in the tree

delete

deletes a node

exists

tests if a node exists at a location

get data

reads the data from a node

set data

writes data to a node

get children

retrieves a list of children of a node

sync

waits for data to be propagated

For a more in-depth discussion on these, and how they can be used to implement higher level operations, please refer to [tbd]
Implementation
ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of the components.
ZooKeeper Components

The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.
Uses
The programming interface to ZooKeeper is deliberately simple. With it, however, you can implement higher order operations, such as synchronizations primitives, group membership, ownership, etc. Some distributed applications have used it to: [tbd: add uses from white paper and video presentation.] For more information, see[tbd]
Performance
ZooKeeper is designed to be highly performant. But is it? The results of the ZooKeeper's development team at Yahoo! Research indicate that it is. (See ZooKeeper Throughput as the Read-Write Ratio Varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)
ZooKeeper Throughput as the Read-Write Ratio Varies

The figure ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. "Servers" indicate the size of the ZooKeeper ensemble, the number of servers that make up the service. Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections from clients.
Benchmarks also indicate that it is reliable, too. Reliability in the Presence of Errorsshows how a deployment responds to various failures. The events marked in the figure are the following:
Failure and recovery of a follower

Failure and recovery of a different follower

Failure of the leader

Failure and recovery of two followers

Failure of another leader

Reliability
To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.
Reliability in the Presence of Errors

The are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.
The ZooKeeper Project
ZooKeeper has been successfully used in many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.
All users and developers are encouraged to join the community and contribute their expertise. See the Zookeeper Project on Apache for more information.

最后编辑于：2017.12.11 06:20:45

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,847评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,208评论 1赞 292
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,587评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,942评论 0赞 205
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,332评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,587评论 1赞 218
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,853评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,568评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,273评论 1赞 242
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,542评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,033评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,373评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,031评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,073评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,830评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,628评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,537评论 2赞 269