Designing Data-Intensive Applications Summary: Chapter 10 - Batch Processing

Explore the intricacies of processing massive data at scale through shared-nothing architecture across distributed nodes.

Dec 18, 2023

Introduction

Starting with the third and final section of the book, the focus leans more towards more practical discussions on how real-world applications deal with data processing at scale, their challenges, and the currently known solutions to deal with them.

It will borrow from all the knowledge built up so far to apply it to applications we deal with in our day-to-day lives.

TL;DR 📚

Datastore Diversity: Many data apps use multiple datastores based on access patterns.
Two Storage Types: Source of truth for decisions, derived data for performant reads.
UNIX Philosophy: Emphasizes automation, rapid prototyping, iteration, and experimentation.
MapReduce Framework: Processes large datasets and parallelizes work across machines.
Human Fault Tolerance: MapReduce improves recovery from buggy code by removing side effects.

Datastore Diversity 🗄️

Most data applications don't use only a single datastore, but multiple types and in combination depending on the access patterns and other aspects of data. It’s only so because no individual tool can satisfy all the needs of today’s complex data system requirements.

Generally speaking, there are two types of data storage systems:

Source of truth: holding the facts the application needs in order to make decisions.
Derived data: another [materialized] view of the source of truth that is redundant, drivable, and built almost always for a performant read operation.

UNIX Philosophy 🐧

In summary, it is:

Automation
Rapid prototyping
Iteration
Being friendly to experimentation
Breaking down large projects into manageable chunks

Not many pieces of software interoperate and compose as well as UNIX tools do today.

It’s an exception, not the norm, to have programs that work together as smoothly as unique tools do. 💪

MapReduce 🗺️

MapReduce is a programming framework to write code that processes large datasets in a distributed system like HDFS.

MapReduce can parallelize the work across many machines.

Like Linux pipes, you can combine multiple MapReduce jobs to pass the output from one to the other.

To write the output of the jobs to an immutable distributed filesystem and not introduce side effects to the jobs, we gain two advantages:

Increase of performance
Increase of maintainability

Fault Tolerance 🛡️

What is human fault tolerant? To be able to recover from buggy code.

You can achieve that with MapReduce by removing side effects from the jobs.

MapReduce, a New Era 🌐

MapReduce, as revolutionary as it may have sounded at the time of its inception, wasn't a new idea, and older databases have already done a similar job before.

The only difference is that those older versions focused more on the parallel execution of analytics queries.

MapReduce opened up the door for the world to realize that, in practice, making data available quickly (think generators in programming), even if it is in a quirky, difficult-to-use format, is often more valuable than trying to decide on the ideal data model upfront that the database is usually trying to enforce.

By the same token, datalake or enterprise data hubs collect data in raw form and worry about schema design later, allowing the data collection to be speeded up.

Using this method, the schema of the data becomes the consumer's problem, which may be different teams and have different priorities.

This idea is dubbed the sushi principle: “Raw data is better”.

This approach of writing the data in raw format in a data system opens the door for many other applications to use those raw formats for their own purposes.

For example, other SQL queries can run on some structured formats to process and produce an analytics view of the data.

Fault Tolerance in MapReduce 🧑‍🚒

The jobs in MapReduce are retryable because the output of one job can be discarded if it didn’t run to successful termination. Scheduling another identical instance will not manipulate the input data as it will treat the input as immutable and only output the processed data once all is ready.

This means we can run the same job multiple times and get the same result. This is idempotency, a truly desirable for data systems.

Beyond MapReduce 🚀

What is materialization in the context of batch processing? The process of writing out intermediate states of multiple related jobs to files.

MapReduce is very powerful, yet connecting different jobs is not easy, so some alternatives sit on top of MapReduce so that the entire workflow of jobs is considered one.

That way, we don’t have to wait for one job to finish before the next one starts, and it can all start processing data as it gets ready.

Thank you for reading Developer Friendly. This post is public, so feel free to share the love with your network. 🎁

Conclusion 🎉

This chapter focused on batch processing and its flagship and well-known tool called MapReduce. The idea of MapReduce did not remain just in Google, and in today’s world, we can see many instances where the same idea is implemented. An example is MongoDB.

Having idempotency is an important aspect of batch processing as it means the input is considered immutable; rerunning the job won’t manipulate the data, and retrying the job is harmless.

The next chapter focuses on stream processing, a superset of batch processing where the nature of data is unbounded, i.e., the input is infinite, constant, and never-ending (e.g., data coming from sensors).

Developer Friendly