Designing Data-Intensive Applications Summary: Chapter 9 - Consensus

Is it really possible to achieve consensus knowing all its challenges!?

Dec 10, 2023

Introduction

This chapter will take all the knowledge we’ve gathered in this book's earlier chapters and apply it to data systems distributed across more than one node, which potentially are geographically apart and connected through network streams.

It will cover some of the findings on how to deal with such a system as well as the challenges of maintaining one.

For all its challenges, if it's possible to avoid distributed systems and keep everything in one machine, it's worth doing so!

Linearizability

The best way of building fault-tolerant systems is to find some general-purpose abstraction with useful guarantees and let the applications rely on those guarantees.

A better name for eventual consistency may be convergence.

Linearizability definition: the database giving the illusion that there is only one replica of the data, i.e., one copy, even if the implementation utilizes high availability, replication, or partitioning.

Another (more precise) definition of linearizability: after any read has returned the new value, all following reads on the same or other clients must also return the new value.

It is possible to check for a system's linearizability, though computationally expensive.

Serializability vs Linearizability

The first is an isolation property of transactions running in parallel, while the latter is a recency guarantee of a register.

The databases that provide both are known as strict serializability.

Two-phase locking (2PL) and serial execution are linearizable. Serializable Snapshot Isolation, however, is not.

Linearizability in Action

Two examples where linearizability is useful:

Leader election
Uniqueness guarantee (e.g., picking a globally unique username)

Linearizability is slow, and that's why most DBs don't provide that guarantee.

Even RAM doesn't provide that guarantee (linearizability) to the CPU for the apparent reason of performance bottleneck.

It has been scientifically proven that skipping the performance overhead of linearizability while getting its recency guarantee is impossible. The link to the academic paper can be found in the book.

Ordering

Why ordering is vital in the context of distributed systems? Because it helps with causality, i.e., one event happening after another.

What is causality? That an event happened after another event, and the latter knows about the former. Consider an application that calculates the accounting system’s aggregation over one month, where it needs to have a durable and consistent view throughout its computation.

A system that obeys the ordering imposed by causality is casually consistent.

Total order: Roots from mathematics and allows any two elements to be comparable, e.g., all natural numbers (1, 2, 3, …) are comparable.

Partial order: the two elements are incomparable (e.g., two sets of data).

Some inferences from the textbook:

Linearizability entails total ordering
Causality entails partial ordering
Linearizability implies causality

Casual consistency is the strongest possible consistency model that does not suffer from performance issues in case of network delays.

Lamport Timestamp

TL;DR: To return a pair of node IDs and counters while all clients and servers keep track of both values. Only the maximum pair is considered valid.

In distributed systems, where multiple independent computers communicate and coordinate their actions, a Lamport timestamp is a logical clock used to partially order events. Unlike physical clocks, which may not be synchronized across different computers, Lamport timestamps provide a way to compare the relative order of events that occur in different parts of the system, even if they happen at the same time.

Here are the critical characteristics of Lamport timestamps:

Monotonically increasing: Each timestamp is assigned a unique value that increases with each event.
Causality-preserving: If event A happens before event B, then the Lamport timestamp of A will be less than the Lamport timestamp of B.
Locally generated: Each process maintains its own local Lamport timestamp and increments it for each event that occurs at that process.

Lamport timestamps are used to solve various problems in distributed systems, including:

Mutual exclusion: Coordinating access to shared resources to avoid conflicts.
Consensus: Reaching agreement among processes on a common value.
Total ordering: Ordering events in a single, consistent order across the entire system.

However, Lamport timestamps also have limitations:

Partial ordering: They only provide a partial ordering of events. Simultaneous events in different processes cannot be reliably ordered using Lamport timestamps.
Overhead: Maintaining and synchronizing Lamport timestamps can introduce overhead in distributed systems.

Here are some alternative approaches to ordering events in distributed systems:

Vector clocks: Provide a more complete ordering of events than Lamport timestamps.
Happened-before relationships: Explicitly define the relationships between events to determine their order.

Lamport timestamp ensures consistent causality.

Version vector can distinguish if two operations are concurrent.

Lamport timestamp is not enough to capture causality. One example includes ensuring the uniqueness of the usernames created.

Therefore, it's not enough to have total ordering. You also need to know when that operation is finalized.

Total Ordering Broadcast

Total order broadcast is a message-passing algorithm in distributed systems that requires reliable delivery and totally ordered delivery among all nodes.

Total ordering broadcast guarantees consistent causality.

Zookeeper and etcd implement total ordering broadcast.

There is a strong connection between total order broadcast and consensus.

Consensus is a big theme of this chapter of the book.

State machine replication: all writes are propagated to every node in the same order.

Total ordering can be used to implement serializable transactions.

Total ordering broadcast is async; messages are guaranteed to be delivered in a fixed order, but there is no guarantee on when messages will be delivered.

Guarantees of Total Ordering Broadcast

You can build linearizable storage on top of the total order broadcast.

Let’s see how total ordering broadcast implements its guarantees:

Total ordering broadcast guarantees fixed message delivery to all nodes using TCP and retransmission.
As for unbounded network delays, it relies on algorithms such as Paxos or Raft.
To address the potential outdated read, they employ quorum reads.

It has been proven that consensus can't be achieved with the possible risk of node crashes. Yet that proof was made in an asynchronous model without timeout. Therefore, a consensus is indeed possible if a timeout or other alternatives to detect node crash is available.

Two-Phase Commit

Two-phase commit (2PC) is a consensus algorithm, though not a very good one, as we will see.

In a distributed system, and when modifying multiple objects, it is not enough to commit on every node independently as it may violate the atomicity and, as a result, consistency of the data.

A transaction has to be irrevocable, i.e., you can't change your mind and roll back once you have committed it.

The commit point of 2PC comes down to a decision of a single node: the coordinator decides the faith of the overall transaction across all nodes.

2PC provides a strict safe guarantee but has a lot of operational overhead.

Consensus

Consensus has four formal pillars:

Uniform agreement: no two nodes can have two different decisions
Integrity: which means a node cannot change its mind later on
Validity: any decision of a node should come from a proposal of another node
Termination: any node that has not crashed should decide on something and not go into an infinite loop

The first three are safety property, and the last one is liveness property.

Most of today’s consensus algorithm implementations ensure that safety properties are always met. Even if a majority of nodes are crashed or a severe network problem exists.

Also, most consensus algorithms assume that there are no byzantine faults, which makes sense because the administrator is responsible for keeping that network private.

The best fall-tolerant consensus algorithms known are:

Viewstamped replication
Paxos
Raft
Zab

These implementations don’t necessarily follow the formal definitions of consensus algorithm, and instead, they decide a sequence of values, which makes them a total order broadcast algorithm.

Consensus, a Price Too Big to Pay!?

To elect a new leader, we first need a leader!

To solve consensus, we first solve consensus!

Therefore, we use epoch (lease) that is monotonically increasing and gives leadership to any given node.

Any conflict between the leaders and the highest epoch wins.

To sum it up, we have two rounds of voting, one for electing the leader and a second time to vote on a leader’s proposal for a database write.

Consensus algorithms are a significant breakthrough that provides fault tolerance, total order broadcast, and the implementation of linearizable atomic operations. Still, they come at a cost, including the potential loss of committed data on failover. Yet some people may accept this risk for the sake of the performance.

A non-negotiable requirement of consensus systems is that the majority of the system must always be operable.

Frequent leader election results in terrible performance in the consensus algorithm and prevents the node from doing any practical work.

Thank you for reading Developer Friendly. This post is public so feel free to share it with your network. 🤗

Conclusion

A wide range of problems in distributed systems are reducible to consensus, i.e., if you have a solution for one, you can easily transform it into the other.

If you find yourself in need of consensus, instead of writing your own algorithm, it is advised to use services like Zookeeper.

Developer Friendly