Designing Data-Intensive Applications Review

March 9, 2022 • 7 min read

book review DDIA

I recently decided to pick up a copy of the book Designing Data-Intensive Applications by Martin Kleppmann. I work on a Manufacturing Execution System (MES) that is responsible for collecting data across several disparate sources, replicating it to long-term durable storage as well as short-term durable message queues, and finally, processing some subset of that data as close to real time as we can get. I’ve seen this book referenced a number of times when discussing system design (and specifically, preparing for system design interviews), and decided to pick it up to continue my own learning. This post outlines some of my background and experience, a sample of a problem of the sort my team would solve in our system, and a brief overview of the first chapter in the book.

Background

I’ve done a fair amount of research into things like parallel programming over the years. I wouldn’t consider myself an expert by any stretch, but I think I can generally hold my own fairly well. One of my favorite books on the topic was Concurrent Programming on Windows (Duffy). It really solidified a number of important concepts for me: concepts regarding thread safety; data dependence and introduction ordering; the history of critical sections, semaphores, and other primitives; trivia about Windows Fibers; parallelization of algorithms and the importance of functional programming paradigms; lock granularity; and so many other details. In some cases, this book was my first exposure to many of these concepts. In others, it helped reinforce and solidify my understanding of the concepts and how to employ them successfully as a developer.

A short time after reading this book, I started to become interested in microservices (like everyone else at the time). It was in learning about these that I got my first real introduction to distributed systems and all the complexity that comes along with them. I learned about things like machine clustering, distributed storage and replication, CAP theorem, and leader election. Around this time, I picked up Site Reliability Engineering: How Google runs Production Systems, and had my first exposure to the concept of chaos testing.

These areas of study, along with learning about database optimization and internals, have led me to where I am now. I currently work on a distributed system that runs on several tens of servers, coordinating tasks and data. Many of the problems I have to solve require designing idempotent business transactions, overall API design, database tuning with our database management team, and other factors that affect performance. I got involved in understanding the architecture of the production system, learning how to monitor it and how various parts work together. That led to me eventually developing a pattern for executing distributed queries across multiple datacenters in some of my applications to deal with data locality and storage constraints. I recently did a lot of prototyping and benchmarking of a new message brokering platform to replace our current solution. I hope that Designing Data-Intensive Applications will help me better achieve the goal of building well-designed, cost-effective, robust applications. Along the way, to help with my own retention, I plan to write about some of the learnings here.

Motivation

I started reading Designing Data-Intensive Applications a few days ago, and have made what I consider to be decent progress (two chapters). I was perfectly happy to just pick it up, read as much as I could in a session, then put it down to come back later. That was before today. At work, I ran into the most peculiar of issues after seeing an alert from one of my production systems. The alert clearly indicated there was something strange that occurred with a particular business transaction being aborted due to a constraint violation. Reviewing the stack trace from the logging system, I realized it was code that I’d written, and I know that code is idempotent and incapable of producing the error that I saw (I’ve got the tests to prove it!). Still, I saw this pesky error message that I couldn’t explain, and mildly concerned me.

One of the fortunate things about building a product that has a serial number is that we can pump that serial number into all of our application logs as part of our structured logging. This enabled me to query everything about the system related to the production of that component for a block of time, ordered by timestamp. One of the things I noticed was that the system had somehow received a copy of the same message twice, and was attempting to process both copies concurrently.

Upon further analysis, I was able to determine the received message wasn’t quite the same. This data comes from an external producer, which simply publishes data using a weak form of an acknowledgement protocol. If an acknowledgement is missed, the data is re-transmitted. Sounds sort of like TCP, right? Getting back to the duplicate data behavior I observed, I realized that the system wasn’t quite smart enough to deal with this case (clearly, it failed). Thankfully, the system also has some top-level retry handlers that use a randomized wait due to some other issues concurrency issues it has to deal with, and as a result, was able to deal with the duplicate data successfully. That transaction performs an “upsert,” intelligently inserting data that does not exist, and updating data that currently exists. The transaction also models a pure function, so the biggest consequence of re-processing the same data is that the system generates some unnecessary write traffic in the database.

Examining the problem further, I arrived at the conclusion that the system needs to have a proper concept of a distributed locking service. The code that was racing when the duplicate data arrived could have avoided that entire condition… If it could cheaply contact and create locks with an intelligent lock service. The nature of those locks would be very short-lived, fine-grained locks, which allow for synchronization around a couple of “key” fields. We have a system that’s used for distributed task coordination, but it’s really meant for coarse-grained scenarios, where lock acquisition is used to determine which node “owns” a task among a group of “nodes,” serving as a cheap substitute for leader election.

Why It Matters

I told the story about data racing and arriving at the conclusion that my system needs a distributed locking service because it’s a specific example of why I think Designing Data-Intensive Applications is probably a good choice for me. The very first chapter discusses system reliability, scalability, and maintainability. A distillation of these is:

Reliability is the ability of the system to handle faults and continue to operate at a particular performance level
- Faults can be hardware, software, or human errors
- Hard-disk failure, network outages, etc.
Scalability is the ability to handle an increase in volume or system load (I liked this to “elasticity”), and answers the question, “can we handle 1.5x, 2x, or 10x our current volume with our current capacity?”
Maintainability is the ability to operate, support, and extend the system over time

It also establishes the idea that many systems are not CPU-bound, but data-bound: the primary challenges in the system are related to things like data volume and production frequency. It specifically discuses Twitter, and cites a talk given by Raffi Krikorian (I don’t have the original reference, but this seems pretty close) about Twitter’s timeline design, comparing it to a fan out in digital circuit design. Scanning the table of contents, it will also talk about things like storage system design, replication, consensus, and many other topics. While I’ve studied most of these things before, Designing Data-Intensive Applications seems like a great resource to refresh and reinforce the concepts, as well learn more about system design through its’ many case studies.

Designing Data-Intensive Applications feels like a good book choice for me. I noticed it came up frequently as recommended reading from folks I admired. I hope it will help me with overall design, better understanding not only the trade-offs that exist, but which ones I’m making, and help think of software differently when analyzing and designing solutions.