Distributed systems for fun and profit

一般来说，我们利用分布式系统是为了：

Storage：扩展存储能力
Computation：扩展计算能力

分布式系统想要达成的目标

Scalability

“Is the ability of a system,network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth。”

Size scalability：增加nodes，可以使系统线性增长，增加dataset不会增加latency
Geographic scalability：地理位置扩展
Administrative scalability：增加nodes不应该增加administrative costs

Performance(and latency)

“Is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.”

Low latency：延迟
High throughout：吞吐量
Low utilization of computing resource: 资源利用率

Availability

“The proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable.”

Fault tolerance

“Ability of a system to behave in a well-defined manner once faults occur”

分布式系统中的模型

System model (asynchronous / synchronous)
Failure model: (crash-fail, partitions, Byzantine)
Consistency model:(strong, weak, eventual)
理想状态下，我们想让分布式系统“表现得好像一个单系统”。但事与愿违，不同种类的failures让这个目标实现起来很困难。比如当出现partition（分区）的时候，你是要为了availability接受用户请求呢，还是为了safety拒绝用户。关于权衡分布式系统各个方面的表现，最著名的就是CAP理论。

分布式系统常用技术

Partition and Replicate

Partition：把数据且分开，便于并行处理。限制数据块大小，提高处理效率，数据分块，提高可用性。
Replication：把数据拷贝多份便于availability、performance、fault tolerance，防止丢数据，缓解慢I/O和慢查询，但是会带来一致性问题，强一致和弱一致。

Up and down the level of abstraction

System model
System model 是为了抽象地描述一个分布式系统的特征。它一组assumptions，包括了各种特性
Nodes in the system model
Nodes 是用来执行程序，储存数据，提供时钟的。它可以执行一系列的指令。
Communication link between nodes
用来描述信道，最多常用的设定是，一个系统中的不同节点之间有不同的信道，每个信道是FIFO的，每条接受到的消息必须是发送过的，消息不能丢失。
Timing / Ordering assumptions
Synchronous：一条消息有确认的延迟上限，Asynchronous：一条消息没有确定的延迟上限。
The consensus problem
Consensus 问题就是所有的节点对某个值打成了共识。Agreement: 所有正常的节点都同意，Integrity：所有的正常节点只能对最多一个值打成一致，Termination：最终总会达成一致，Validity：最终达成一致的值必须是V1到Vn其中一个, 如果所有初始值都是vx, 那么最终结果也必须是vx.
FLP and CAP
FLP:在异步通信场景，即使只有一个进程失败，也没有任何算法能保证非失败进程达到一致性！
CAP:Consistency, availability, partition tolerance不能同时达到
Consistency model
Strong consistency models (capable of maintaining a single copy)
1.Linearizable consistency: 每个节点按原有的时间顺序执行指令，2.Sequential consistency：每个节点按同样的时间顺序执行指令

Weak consistency models(not strong)
1.Client-centric consistency，2.Causal consistency，3.Eventual consistency

Time and order

本章讲了一个分布式系统中的时间和顺序。对于order，定义的是先后顺序。而对于time来说，除了先后顺序以外，还包含了interpretation（如何解释时间），duration（定义事件间隔）

Order

Partial order 与 total order，也就是偏序和全序的关系。全序就是任何两个元素能比较大小，而偏序不一定，比如A包含B，则说A大于B，但存在集合互不包含，两个集合就无法比较大小。

Time

Global clock：对应模型Synchronous，支持Total Order，分布式中所有节点都共享一个时钟，任意两个操作可以被赋予顺序。
Local clock：对应模型Partially synchronous，一种Partial order. 本地有序，跨节点之间的事件是无序的。不能比较两个节点上的时间戳
No clock：对应模型Asynchronous：另一种Partial order. 本地有序，远程需要交互才能确定顺序。提供了偏序，可以确定本地顺序，跨节点需要由message change来确定

分布式系统中order和time的作用

order：确定操作顺序可以确保正确性，可以解决争夺资源时候的顺序问题，global clock可以确定整个系统的order，没有global clock，需要通过communication来确定
time：可以用在failure detector，确定high latency 或者 link is down

系统中逻辑时钟

Lamport clock
Vector clock
Failure detector：failure detector用 “heartbeat messages” 和 “timers” 来实现。即不断发送“心跳信息”之后计算时间。而衡量Failure detector，有两个重要特性: completeness 和 accuracy.

Replication：preventing divergence

synchronous replication、asynchronous replication

Replication methods that prevent divergence (single copy systems)
Replication methods that risk divergence (multi-master systems)
Single copy system让整个系统“表现的像是一个节点”。每当部分节点宕机，系统可以保证只有一个active的值，还有，可以保证这个值被大家接受（其实就是consensus问题）

Primary/Backup

Single, static master
Replicated log, slaves are not involved in executing operations
No bounds on replication delay
Not partition tolerant
Manual/ad-hoc failover, not fault tolerant, “hot backup”

2PC

Unanimous vote: commit or abort
Static master
2PC cannot survive simultaneous failure of the coordinator and a node during a commit
Not partition tolerant, tail latency sensitive

Paxos

Majority vote
Dynamic master
Robust to n/2-1 simultaneous failures as part of protocol
Less sensitive to tail latency

参考资料

分布式之【CAP理论、BASE理论、FLP不可能定理】