Designing Data-Intensive Applications Summary: Chapter 6 - Partitioning

How to split data between data stores to achieve scalability.

Oct 08, 2023

Introduction

This chapter goes hand in hand with the last one, in which data stores usually implement partitioning (also known as sharding) and replication simultaneously.

However, the scheme to use for one does not overlap.

That said, let’s jump into the notes for this chapter.

Why?

The main reason for partitioning data is scalability, i.e., to handle more load, more users, etc.

This chapter discusses the following topics:

Approaches to partitioning
Rebalancing (essential for adding/removing nodes)
Request routing between partitions in different nodes

What?

Partitioning and replication go hand-in-hand, i.e., even though one node is the leader of a partition, its data is replicated for fault tolerance.

The choice of partitioning scheme is mostly independent of the choice of replication scheme.

Skewing

Definition of skew: unfair partitioning where one node holds more data than others.

Skewing makes partitioning less effective and causes hotspots (My take: it’s the most challenging part of partitioning/sharding).

To distribute the data evenly among partitions, the partition boundaries need to adapt to the data, as in, there shouldn't be strict boundaries for some ranged keys, but rather a flexible distribution that will lead to the data being distributed evenly among nodes and avoid hotspots.

The partitioning boundaries can be set manually by an administrator using the domain knowledge of the data or automatically by the storage engine.

Each partition will keep the keys sorted to make range queries faster.

To avoid skews and hotspots, many distributed data stores use a hash function to determine the partition for a given key.

Consistent Hashing

Consistent hashing works by assigning each key to a node in a hash ring. The hash ring is a circular sequence of nodes, and each key is assigned to the node at the position on the hash ring closest to the key's hash value.

Minimizes the number of keys that need to be remapped when nodes are added or removed (My take: it’s also one of the popular questions of system-design interviews).

Hashing the key and assigning it to partitions will give away the ability to do range queries on the primary key.

As much as hashing the key can help with skew and hotspots, it won't avoid them altogether, i.e., when the queries are for a specific key, all the requests will hit the same node.

Most data systems can't automatically deal with skews; that's why it's the application developer's responsibility, e.g., to add random strings to the hashed keys so that the data is evenly distributed.

Partitioning can be different and complex if queries use secondary indexes (mapping to the primary index and requiring an extra hop for the disk seek or network call).

Secondary Indexes

There are two ways to handle secondary indexes in partitioning:

Document-based
Term-based

Document-Based

Document-based means each replica will have a local index of secondary indexes (compared to global indexes in term-based partitioning) saved in its state and can respond efficiently to those queries and secondary indexes.

It requires sending read queries to all replicas and then aggregating all the returned rows, also known as scatter-gather.

Even though it's done in parallel in multiple machines, it's prone to tail latency amplification (the time it takes for the slowest node to respond) and is expensive.

Regardless, popular databases use this method for their partitioning.

Examples include MongoDB, Couchbase, Cassandra, ElasticSearch, CouchDB [1].

Term-Based

Term-based partitioning is to keep a global index of the secondary indexes. Those global indexes are also partitioned to avoid bottlenecks. These indexes track where to look for each range of the secondary index query.

This way, reads are faster, but writes are taking a penalty for this bookkeeping.

Rebalancing

The minimum requirement for data rebalancing with any approach is the following:

After rebalancing, the load should be distributed fairly
The database should still be available during rebalancing
The least amount of movement should occur for data within each node

Partition Strategies

We generally have two approaches available: static & dynamic.

Static Partition Size

Hash % N is not a good partitioning strategy; if the number of nodes changes at some point, many rebalancing will occur, and that causes performance issues for the actual queries.

Another approach is to have a fixed number of partitions.

A better solution is to have many more partitions than nodes; this way, new nodes can steal some partitions from each node while the old partition still serves requests.

A good number of partitions would be the maximum number of expected nodes. Either way, it mustn’t be too large to make node recovery difficult, nor too small to incur management costs.

Dynamic Partition Size

Dynamic partitioning is to split big (configurable) partitions and merge those that are too thin, allowing the data to flow between nodes at runtime and efficiently coordinate the data's size.

It works well for both key range-partitions as well as hash-partitions.

Automatic vs. Manual Partitioning

Between automatic and manual partitioning (the decision part), the former has less operational overhead but is prone to failure if something goes wrong; hence, it's a good idea to keep an administrator in the loop; it's more overhead but safer (and we can never have enough safety, let me tell you, brother).

Request Routing Methods

Generally speaking, there are three approaches available.

The data node will either respond to the query or act as a middleman to get the query response from the responsible node/partition.
A proxy will handle all routing.
The client knows who to talk to for each partition.

Conclusion

This chapter showed me how difficult it is to maintain and operate a distributed data storage system. It covered many of the challenges one will face if ever system partitioning becomes imminent.

Thank you for reading Developer Friendly. This post is public, so feel free to share it with your network.

References

[1] https://chat.openai.com/share/cc819ab9-4171-4545-89fa-49888456de82

TL;DR

This chapter of the book explores the importance of data partitioning in distributed systems for scalability.

It covers various topics, such as approaches to partitioning, rebalancing, request routing, and dealing with skew in data distribution.

It also discusses consistent hashing, handling secondary indexes, rebalancing strategies, and the choice between automatic and manual partitioning.

It provides insights into the key considerations and methods for optimizing data distribution in distributed data stores.

Developer Friendly