tail latency – Pdawg

Notes from Chapter 1 of Designing Data-Intensive Applications

If you’ve spent any time around backend engineers, you’ve probably noticed they love to argue. Postgres or MongoDB? Kafka or RabbitMQ? Microservices or “the modulith”? Most of these debates feel like they’re about technology, but they’re almost never really about technology. They’re about tradeoffs — and the tradeoffs only make sense when you know what you’re optimizing for.

Martin Kleppmann opens Designing Data-Intensive Applications by giving us the vocabulary to have those arguments properly. Chapter 1 is called “Reliable, Scalable, and Maintainable Applications,” and those three adjectives are the entire point of the book. Get them right, and the rest of the 500 pages is essentially a tour of how different systems make different bets in service of those three goals.

Here’s what the chapter actually says, and why it’s worth slowing down on before racing into the chapters about replication and consensus.

What even is a “data-intensive” application?

Kleppmann’s framing in the first few pages is small but important: most of the systems we build today are not bottlenecked by raw CPU. They’re bottlenecked by the amount of data, the complexity of it, or the speed at which it changes. A web app that serves a million users isn’t doing hard math. It’s juggling state — reading it, writing it, caching it, indexing it, replicating it, keeping it consistent enough to be useful and inconsistent enough to be fast.

These applications are built out of remarkably standard parts: databases, caches, search indexes, message queues, stream processors, batch processors. The interesting engineering question isn’t usually “which database?” It’s “how do these pieces fit together for this workload?” Two apps with identical tech stacks can have wildly different architectures because they’ve answered that question differently.

That sets up the rest of the chapter. If your job is gluing data systems together, what are you actually trying to achieve?

Reliability: keep working when things go wrong

The first goal is reliability — and Kleppmann gives a definition that sounds obvious but is genuinely useful: a reliable system continues to work correctly even when things go wrong.

The key distinction here is between a fault (one component misbehaving) and a failure (the whole system stopping). The job of a reliable system isn’t to prevent faults — that’s impossible — it’s to prevent faults from cascading into failures. That’s what “fault-tolerant” means.

Faults come in three flavors:

Hardware faults. Disks die, power cuts out, networks flake. We’ve been dealing with these for decades, mostly through redundancy: RAID arrays, dual power supplies, multiple availability zones. This is the easy category, in the sense that the failure modes are well understood.
Software errors. Bugs, runaway processes, cascading failures where one slow service takes down everything that depends on it. These are nastier because they can hit every replica simultaneously — your fancy redundancy won’t save you if all three nodes have the same bug.
Human errors. And here’s the punchline: humans cause more outages than hardware. The defenses are good abstractions, sandboxed environments for testing, telemetry that catches problems early, and — crucially — making it easy to roll back when someone inevitably ships something broken at 4pm on a Friday.

It’s tempting to skip reliability work on “non-critical” applications, but Kleppmann pushes back on that: the cost of losing user trust usually exceeds the cost of building things properly the first time. A photo app isn’t life-or-death, but if it loses your wedding photos once, you’re never opening it again.

Scalability: cope with growth

Scalability is the one everyone thinks they understand and almost no one defines properly. Kleppmann’s framing is that “scalable” isn’t a property a system has or doesn’t have — it’s a question, and the question only makes sense if you specify two things: what you mean by load, and what you mean by performance.

Load is whatever parameter actually pressures your system. For a web server it might be requests per second. For a cache it might be the hit rate. For a database it might be the read/write ratio. The chapter’s famous Twitter example shows why this matters: serving home timelines is a hard problem, but the right solution depends entirely on whether you optimize for the read path (fan-out on write, materialize each user’s timeline) or the write path (fan-in on read, query everyone’s posts when the user opens the app). Twitter actually switched approaches as their workload changed. Same problem, different load characteristics, different architecture.

Performance is the other side. And here Kleppmann lays down what I’d argue is the single most important graph in backend engineering: don’t use averages, use percentiles. A system with a 100ms average response time can still be miserable to use if the slowest 1% of requests take 10 seconds. Tail latencies — p95, p99, p999 — are what users actually feel, and they tend to disproportionately hit your most engaged customers, the ones who make the most requests and therefore have the most chances to roll the bad-luck dice.

Once you know your load and your performance target, scaling becomes a design problem. You can scale up (bigger machine) or out (more machines), and the right answer depends on the workload. There is no universal scalable architecture. Anyone selling you one is selling you something.

Maintainability: make it livable

The last goal is the one engineers love least and pay for most. Most of a system’s lifetime cost is not in writing it; it’s in keeping it running, evolving it, and onboarding new people to it. Maintainability is the property that lets future-you (or the person who replaces you) keep the lights on without losing their mind.

Kleppmann breaks it into three sub-principles:

Operability. Make it easy for the operations team to keep the system healthy. Good monitoring. Predictable behavior under load. Useful logs. Documentation that exists. Default behaviors that make sense. The classic “it works on my machine” failure is an operability failure.
Simplicity. Manage complexity. Every system accumulates accidental complexity over time, and the main weapon against it is good abstractions — taking something messy and giving it a clean interface. The opposite, which Kleppmann calls a “big ball of mud,” is what happens when you skip this and let everything tangle into everything else.
Evolvability. Make it easy to change. Requirements always change. The org changes, the product changes, the regulations change, the scale changes. Systems that can’t evolve get rewritten, and rewrites are expensive and dangerous.

If you’ve ever worked on a codebase where every change feels like defusing a bomb, you’ve experienced the absence of all three at once.

The takeaway

Reliability, scalability, maintainability. Three words, and the entire book is essentially “how do specific tools and techniques affect these three properties for specific workloads?”

What I find genuinely useful about this chapter is that it gives you a way to evaluate technical decisions without falling into religious wars. When someone says “we should use Kafka,” the question isn’t whether Kafka is good. It’s: which of these three properties does Kafka improve, by how much, for our workload, and at what cost to the others? Sometimes the answer is “a lot, cheaply, do it now.” Sometimes the answer is “not really, and it’ll add a fourth on-call rotation.”

There are no silver bullets. There are only tradeoffs against a specific set of requirements — and the first job of any backend engineer is being able to say what those requirements are.

That’s the real lesson of Chapter 1. The rest of the book is just receipts.

Tag: tail latency

The Three Words Every Backend Engineer Should Tattoo on Their Forearm