Designing Data-Intensive Applications Summary: Chapter 5 - Replication

How to copy data between data systems and keep them in sync.

Oct 07, 2023

Introduction

Until so far, in part 1 of the book, the discussions were mainly around a single instance of the data system. However, part 2 of the book focuses more on distributed data systems, their interactions, and challenges, as we shall see in this and some later articles.

Distributed Systems

What is distributed systems again? [1]

Distributed data systems are networks of interconnected computers that work together to store and process data across multiple locations or nodes, enhancing scalability and fault tolerance.

Why do we need distributed systems?

Scalability, i.e., handling more load
Fault tolerance / high availability, e.g., in case one node goes down
Latency, i.e., to reduce response time

Replication

So, what is replication? [1]

Replication in data systems refers to the process of duplicating and maintaining identical copies of data across multiple storage locations or nodes to enhance data availability and fault tolerance.

Why do we need replication?

Geography proximity to the user
Availability in case of system or component failure
Scale-out for higher throughput

Replicating never-changing data is easy; do it once and read it forever. The main challenge, however, is to replicate constantly changing data (basically, our day-to-day databases are one obvious example).

Replication Methods

There are three approaches available on any system:

Single-leader
Multi-leader
Leaderless

All distributed systems use either of these.

Leader-Based Replication

Leader-based data systems work this way: writes are processed only by the leader, and reads can be processed by either the follower(s) or the leader.

Well-known systems following this design: Postgres, MySQL, Oracle, SQL server, MongoDB, RethinkDB, Espresso, Kafka, Rabbitmq.

Most databases handle replication under 1s, but that's not guaranteed, and to make matters worse, nodes can also fail anywhere in between.

That's why synchronous replication for all followers is impractical, and usually one follower is synchronous, and others are async (also called semi-synchronous).

Practical Approach For Replication

Using snapshots, especially ones that don't take a lock on the database, you can add more followers to the system without downtime.

How to Handle Node Outages (followers)?

Using the Write-Ahead Log (WAL), replicate all the unapplied changes and ask the leader for the rest.

How to Handle Node Outages (the leader)?

This one is a bit trickier, but with a proper timeout, the system can detect the leader is unresponsive, consider it dead, promote a follower to be the new leader, and ensure the old one won't process any write requests.

Challenges of Async Replication

In the case of an asynchronous replication log, there are some challenges:

Some of the old leader's persisted writes may not have yet propagated in case of failover. Discarding those writes violates durability.
Split-brain can occur where two nodes assume they are the leader simultaneously. It may be a good option to shut down one of them to avoid data inconsistency (also known as Shoot The Other Node In The Head or STONITH for short).
The timeout for assuming the leader is dead can't be too short, nor too long, as each can harm the system in their own way.

Implementations of Replication Log

Approaches on how to replicate data between followers and the leaders:

Statement-based: this type of replication is a method of data replication in which individual SQL statements are replicated from a source database to one or more destination databases to keep data synchronized [1]. Even though it seems reasonable, it has downsides because:
1. Non-deterministic functions will behave differently on different nodes.
2. UPDATE statements with conditions need to happen in the same order as the leader to have the same outcome.
3. Some statements might have side effects, e.g., triggers, which can possibly behave differently on different machines because they may execute in different order.
WAL-shipping: Both SSTables and B-trees have such a log. The leader will forward this file to followers. The downside is that WAL operates on a byte level and can become incompatible between different database versions.
Logical log replication: it operates on a row level and gives enough information for the follower to replicate the changes; it has the advantage of decoupling the underlying storage engine and the data.
Trigger-based replication: Delegate the responsibility of the data replication to the application logic code by triggers or stored procedures that will invoke a script that must be executed at low latency.

Guarantees to Fix the Replication Lag

Read-your-own-writes (implementation below)
1. You need to know who modified what (bookkeeping) and serve the readers for that only on the leader.
2. Based on the access level of the application layer, users with write access to the row will query from the leader and others from the follower(s).
3. The client can hold a monotonic timestamp, using which any outdated follower won't respond to queries for that client.
4. Geographically distributed systems are more challenging because of coordination in this type of guarantee.
5. Cross-device read-your-own-write needs central coordination, which is another challenge.
Monotonic reads: a user that has seen a newer version of a record will ignore all the older/outdated ones. One way of dealing with this is to send all reads of each user to the same replica, though in case of replica failure, the user will need to be rerouted.
Consistent prefix reads: when there is a causality relationship between writes, having replicas acting independently may violate that relationship
one solution is to keep all related writes in the same replica, but that's not scalable.

Multi-Leader Replication

Multi-leader replication is a data replication approach where multiple nodes, known as leaders, independently accept and replicate changes, and they share data with followers through a bidirectional data synchronization process, which can be either synchronous or asynchronous, enabling high availability and concurrent writes across different parts of the system. [1]

Multi-leader replication is challenging but may be desired for some reasons, e.g., geographical proximity to the user, high availability, etc.

Introducing multi-leader means the possibility of writing conflicts, and so conflict resolution is required.

Conflict Resolution

One way to handle conflict resolution is to have locks so that no two users can modify the same record simultaneously, though this beats the whole point of multi-leader, and we're better off with a single leader instead.

Convergent conflict resolution: all replicas must arrive at the same final value. Here are the available methods to do so.

Given each write a unique ID, a higher write ID wins as the last write (implies data loss).
Given each replica a unique ID, a higher replica ID wins (data loss again).
Merge and concatenate them and store both in the storage.
Record conflict and ask the user to resolve the conflict later (you may have seen this during git merge)
Custom conflict resolution: writing scripts that the database executes on-read or on-write when a conflict occurs.

One crucial note is that conflict resolution happens at the record level!

Automatic Conflict Resolution

Here are the available algorithms:

conflict-free replicated datatypes (CRDTs)
mergeable persistent data structures
operational transformations

Multi-leader is no cake, and users are responsible for reading the docs of their respective data store and testing the workload to ensure the promises of the database to avoid any surprises.

Leaderless Replication

Finally, there are no-leader databases, in which the client is responsible for ensuring that the quorum of replicas is answering the same way on both writes and reads.

Quorum

Quorum reads and writes: if we have N replicas, W replicas should respond to writes for that write to be accepted as persisted, and R replicas should respond to reads in a way that W + R > N (strict quorum).

Sloppy quorums allow a relaxed version such as W + R <= N, which may result in reading stale values but have lower latency and higher availability.

With a strict quorum, we can expect every read to be the most up-to-date write because at least one of the R replicas will overlap with one of the nodes in the W replicas that have written the record.

Having at least one overlapping replica with the most up-to-date value, the client will dismiss the stale and outdated versions and write the most up-to-date version to the stale replicas (read-repair).

To keep consistency, we can either update-on-read (a.k.a. read-repair; method above) or have a background process doing that (anti-entropy).

Eventual Consistency and Hinted Handoff

Eventual consistency is deliberately a vague guarantee, but for operability, it's vital to be able to measure "eventual".

A sloppy quorum results in some nodes having stale values if, e.g., a network partition occurs.

Later, when those nodes join the cluster again, the up-to-date nodes will send them the new writes, also known as hinted handoff.

Leaderless Conflict Resolution

How do we resolve conflicts in a leaderless database system?

Imagine having only one replica. Every client request should have a version number attached to the last record that the client has seen. This version number can be fetched upon the client’s first interaction with the database.
This way, every client request will have that version number, and the database can track the causal relation between writes and won’t process write requests with a lower version number than the one already processed.
It adds one effort to the client but won't lose data like Last Write Wins (LWW) does.

The same algorithm can be generalized for a multi-replica database, except this time, every node will respond with one replica number and the version number (mentioned above).

This way, we still have the guarantee of durability, and resolving conflicts can be done using methods like CRDT data structure.

Conclusion

In summary, distributed systems enhance availability and scalability, but they are no piece of cake, and there are a lot of challenges involved in operating such a system.

Therefore, reading the documentation and consulting with professionals is recommended to build or maintain your system.

Bonus

After reading this chapter of the book, I got curious about what modern database systems are doing regarding the replication in their architecture. I consulted with my good old friends Google Bard and ChatGPT.

Here’s the gist of it, but feel free to get the complete reference [2][3].

Nearly all well-known databases support async and sync replication, configurable by the administrator.
MySQL, PostgreSQL, and MongoDB operate in single-leader mode, Couchbase and CockroachDB are multi-leader, and Cassandra works in a leaderless way.
MySQL, MongoDB, Cassandra, Oracle, and SQL Server replicate using logical mode, while PostgreSQL does it on WAL-shipping mode.
Modern databases usually have poor support for conflict resolution. They may need manual intervention, but future works may include automatic conflict resolutions using the earlier algorithms or a new approach.
CouchbaseDB uses CRDT, Cassandra uses LWW, and CockroachDB uses the Raft algorithm to achieve consensus (I really like the Raft algorithm, and I plan to implement a Rust implementation later myself for the joy of learning experience).
External tools are available for leader failover and election for MySQL and PostgreSQL, but MongoDB and Cassandra have built-in support.
Again, PostgreSQL and MySQL require external tools to avoid split-brain, but MongoDB and Cassandra have that as a first-class citizens.
For consensus, Cassandra uses consensus within nodes, CockroachDB uses the Raft algorithm, Riak uses vector clocks, and CouchbaseDB uses CRDT.
None of the modern-day database systems have built-in support for shoot-in-the-head (often used to avoid split-brain in multi-leader and leaderless systems).

References

[1] https://chat.openai.com/share/3a9b6e5c-21ba-45bc-a88e-c3a56e05f64b

[2] https://g.co/bard/share/1a0e7aeef131

[3] https://chat.openai.com/share/914b9a55-8429-4999-a7b5-009dedc3b5b9

Thank you for reading Developer Friendly. This post is public, so feel free to share it with your networks.

TL;DR

In Chapter 5 of "Designing Data-Intensive Applications," the focus shifts from single-instance data systems to distributed data systems. Distributed data systems are networks of interconnected computers that collaborate to store and process data across multiple locations or nodes, enhancing scalability and fault tolerance.

Replication plays a crucial role in achieving these goals by maintaining identical copies of data across multiple nodes. Three standard replication approaches are single-leader, multi-leader, and leaderless, each with advantages and challenges.

Ensuring consistency in distributed systems involves techniques like read-your-own-writes, monotonic reads, consistent prefix reads, and various conflict resolution strategies.

Finally, leaderless databases require clients to manage version numbers to ensure causal relationships between writes, enabling conflict resolution without data loss.

Developer Friendly