袖珍分布式系统（二）

Paste_Image.png

本文是Distributed systems for fun and profit的第二部分，本文是阅读该文后的一些记录。

分布式编程大多数数时间都是在处理分布式后带来的影响。为什么这么说呢？因为虽然理想情况是：我们在分布式系统上编程跟在单机上编程一样，这种抽象对于程序员来说是最友好的，但是呢？理想很丰满，现实很骨感，我们必须拨开抽象，去处理影藏在单机抽象背后的多机系统带来的问题，才可能很好的解决问题。因此，我们现在不断在寻求一个更好的抽象模型，尽可能的让编程在分布式环境下变的简单。

那问题来了，怎么定义一个抽象更好呢？

What do we mean when say X is more abstract than Y? First, that X does not introduce anything new or fundamentally diﬀerent from Y. In fact, X may remove some aspects of Y or present them in a way that makes them more manageable. Second, that X is in some sense easier to grasp than Y, assuming that the things that X removed from Y are not important to the matter at hand.

总结起来就是：用尽可能少的假设来描述清楚一个东西。

抽象非常重要，能够帮我们抓住主要问题，如果我们能抓住本质，那此时想出来的解决方案已定也是最通用的。

那问题有来了，我们怎么知道哪些东西是主要的，或者说是本质的？

我们每次在实际问题上，剥离掉系统一个具体的限制的时候（假设这个不是主要问题），都可能给系统埋下隐患，这也是为什么后期系统必须要重新将限制引入系统，因为这些去掉的限制，才是影响到系统工作的真正因素。

有了这个概念，接着就可以介绍在保持下系统工作的同时，最少的假设条件是多少，这种假设条件演化出来的就是我们经常说的系统模型。

A system model

分布式系统最大的属性就是：分布式，更具体来说，一个分布式系统中的程序具有的属性有：

run concurrently on independent nodes …【独立节点上并发执行】
are connected by a network that may introduce nondeterminism and message loss …【通过网络互连】
and have no shared memory or shared clock.【不共享内存和时钟】

具体解释是：

each node executes a program concurrently【每个节点都并发执行】
knowledge is local: nodes have fast access only to their local state, and any information about global state is potentially out of date 【每个节点只知道自己节点上的信息】
nodes can fail and recover from failure independently 【每个节点失败和恢复都是独立的】
messages can be delayed or lost (independent of node failure; it is not easy to distinguish network failure and node failure) 【通信是不可靠的】
and clocks are not synchronized across nodes (local timestamps do not correspond to the global real time order, which cannot be easily observed)【时钟不同步】

那什么是系统模型？

System model：a set of assumptions about the environment and facilities on which a distributed system is implemented

系统模型定义了关于 environment and facilities 的假设，这些假设包括：

what capabilities the nodes have and how they may fail 【每个节点能力和失败方式】
how communication links operate and how they may fail and 【节点间通信方式和失败方式】
properties of the overall system, such as assumptions about time and order【整个系统属性：如时序】

什么是抗造系统，就是系统模型做了最少的假设，由于假设少，基于这种系统设计的算法，都是容错性非常好的，但是同样的，由于假设少，所以算法也很难理解。

下面是具体介绍：properties of nodes, links and time and order

Nodes in our system model

节点是系统有计算和存储能力的hosts，它们能：

the ability to execute a program 【计算】
the ability to store data into volatile memory (which can be lost upon failure) and into stable state (which can be read after a failure) 【存储】
a clock (which may or may not be assumed to be accurate)【时钟】

此处我们假设node的failure models是 crash-recovery failure model，而考虑 Byzantine fault tolerance，即拜占庭错误

Communication links in our system model

分布式系统中最难处理的假设就是通信假设，我们在分布式系统中，一个系统很难知道另一个系统的情况，因为任何的通信都是不可靠的，信息都无法交流，还怎么知道别人的情况，因此分布式系统中，能依赖的只有节点本身的信息。

Timing / ordering assumptions

在分布式系统中我们必须认识到：每个node看到的世界都是不同的，这个不同来自于一个事实：信息的传输需要时间。对于同一件事情，每个节点看到这个事情的时间都是不一样的，因此每个节点看到的世界，其时间点都是不同的。

有两个主要的关于时间的模型:

Synchronous system

model Processes execute in lock-step; there is a known upper bound on message transmission delay; each process has an accurate clock

Asynchronous system

model No timing assumptions - e.g. processes execute at independent rates; there is no bound on message transmission delay; useful clocks do not exist

The consensus problem

下面对网络是否分区包含在错误模型中和网络传输是同步还是异步模型两个条件的讨论

whether or not network partitions are included in the failure model, and
synchronous vs. asynchronous timing assumptions

先介绍下什么是一致性模型

Agreement: Every correct process must agree on the same value.
Integrity: Every correct process decides at most one value, and if it decides some value, then it must have been proposed by some process.
Termination: All processes eventually reach a decision.
Validity: If all correct processes propose the same value V, then all correct processes decide V.

Two impossibility results

什么是impossibility results

A proof of impossibility, also known as negative proof, proof of an impossibility theorem, or negative result, is a proof demonstrating that a particular problem cannot be solved, or cannot be solved in general. Often proofs of impossibility have put to rest decades or centuries of work attempting to find a solution. To prove that something is impossible is usually much harder than the opposite task; it is necessary to develop a theory. Impossibility theorems are usually expressible as universal propositions in logic (see universal quantification).

说白点就是证明了某项事情是不可能的，这样子人们就不用去朝着这个东西再去白费力气了。

在分布式中有两个重要的impossibility results，FLP 和 CAP，FLP的重要性在于学术研究，此处不做过多介绍，下面主要介绍下CAP

The CAP theorem

CAP中每个字母代表是：

Consistency: all nodes see the same data at the same time.（一致性）
Availability: node failures do not prevent survivors from continuing to operate.（可用性）
Partition tolerance: the system continues to operate despite message loss due to network and/or node failure（分区容忍性）

上面3个特性，只有2个能同时满足，因此我们会有3种系统：

CA (consistency + availability). Examples include full strict quorum protocols, such as two-phase commit.
CP (consistency + partition tolerance). Examples include majority quorum protocols in which minority partitions are unavailable such as Paxos.
AP (availability + partition tolerance). Examples include protocols using conﬂict resolution, such as Dynamo.

CA和CP系统都提供了强一致性模型，不同是CA不可以容忍网络分区，而CP在2f+1个节点中，可以容忍f个节点失败，原因很简单：

A CA system does not distinguish between node failures and network failures, and hence must stop accepting writes everywhere to avoid introducing divergence (multiple copies). It cannot tell whether a remote node is down, or whether just the network connection is down: so the only safe thing is to stop accepting writes.【不能区分网络分区和节点失败，因此必须停止写入避免引入不一致】
A CP system prevents divergence (e.g. maintains single-copy consistency) by forcing asymmetric behavior on the two sides of the partition. It only keeps the majority partition around, and requires the minority partition to become unavailable (e.g. stop accepting writes), which retains a degree of availability (the majority partition) and still ensures single-copy consistency.【即使网络分区了，大多数节点的一方还是能够提供服务】

CP系统因为将网络分区考虑到了failure model中，因此能够通过类似Paxos, Raft 的协议来区分a majority partition and a minority partition

CA则由于没有考虑网络分区的情况，因此无法知道一个节点不响应式因为节点收不到消息还是节点失败了，因此只能够通过停止服务来防止出现数据一致，在CA中由于不能保证网络可靠性，因此通过使用two-phase commit algorithm来保证数据一致性。

从CAP理论中我们能得到4个结论：

First, that many system designs used in early distributed relational database systems did not take into account partition tolerance (e.g. they were CA designs). Partition tolerance is an important property for modern systems, since network partitions become much more likely if the system is geographically distributed (as many large systems are).【早期系统大多没有考虑P，因此是CA系统，但是现代系统，特别是出现异地多主后，必须考虑分区了】
Second, that there is a tension between strong consistency and high availability during network partitions. The CAP theorem is an illustration of the tradeoﬀs that occur between strong guarantees and distributed computation.【P既然无法避免，我们只能在C和A之间做选择，有时候我们可以通过降低数据的一致性模型，不再追求强一致，从而达到"CAP"】
Third, that there is a tension between strong consistency and performance in normal operation.【当一个操作涉及的消息数和节点的数少的时候，延迟自然就低，但是这也意味着有些节点不会被经常访问，意味着数据会是旧数据】
Fourth - and somewhat indirectly - that if we do not want to give up availability during a network partition, then we need to explore whether consistency models other than strong consistency are workable for our purposes.【有时候3选2可能是误解，我们如果将自己不限制在强一致性模型，我们会有更多的选择】

我们要记住：

ACID consistency != CAP consistency != Oatmeal consistency

一致性模型的概念是：

Consistency model

a contract between programmer and system, wherein the system guarantees that if the programmer follows some speciﬁc rules, the results of operations on the data store will be predictable

一致性模型是编程者和系统之间的契约，只要编程者按照某种规则，那计算机的操作结果就是可预测的。

下面介绍一些一致性模型：

Strong consistency vs. other consistency models

Strong consistency models (capable of maintaining a single copy)
- Linearizable consistency
- Sequential consistency
Weak consistency models (not strong)
- Client-centric consistency models
- Causal consistency: strongest model available
- Eventual consistency models

一致性模型可以分为两大类：强一致和弱一致。

强一致模型给编程者提供的是一个和单机系统一样的模型，而弱一致，则让编程者清楚的意识要是在分布式环境下编程，而不是单机环境。

Strong consistency models

强一致性模型可以再细分为两大类：

Linearizable consistency: Under linearizable consistency, all operations appear to have executed atomically in an order that is consistent with the global real-time ordering of operations. (Herlihy & Wing, 1991)
Sequential consistency: Under sequential consistency, all operations appear to have executed atomically in some order that is consistent with the order seen at individual nodes and that is equal at all nodes. (Lamport, 1979)

两者的最大不同是：linearizable consistency要求操作的结果要和操作实际执行的顺序一致，而Sequential consistency则允许操作实际发生的顺序和操作产生结果的顺序不同，只要每个节点看到的顺序是一样的就行。两者之间的差别基本上可以忽略。

Client-centric consistency models

该一致性模型主要是为了解决下面的情况：客户端进行了某个操作，同时也看到了最新的结果，但是由于网络中断，重新连接到server，此时不能因为重新连接而看到一个旧的结果。

Eventual consistency

最终一致性我们需要知道两点：

最终一致，这个最终是多久？我们需要有个下限，或者至少是一个平均值
多个副本怎么达成一致？

First, how long is "eventually"? It would be useful to have a strict lower bound, or at least some idea of how long it typically takes for the system to converge to the same value.【最终，这个时间是多久】
Second, how do the replicas agree on a value? how非常重要，因为如果设计的不好，可能会导致数据丢失。

因此，在谈论最终一致的时候，我们需要知道这可能是："eventually last-writer-wins, and read-the-latest-observed-value in the meantime"