database design

Notes from Chapter 2 of Designing Data-Intensive Applications

There’s a particular kind of argument that breaks out on engineering teams roughly once a quarter: SQL or NoSQL? It’s usually framed as a question about technology, or performance, or scalability, but it almost never actually is. It’s a question about shape. Specifically: what shape is your data, and what shape do you want to query it in?

Chapter 2 of Designing Data-Intensive Applications is called “Data Models and Query Languages,” and Martin Kleppmann uses it to make a case that’s surprisingly hard to internalize: the data model you pick has more impact on how your software gets written — and how it ages — than almost any other decision. Pick the wrong shape and you’ll spend years gluing it back together with application code. Pick the right one and a lot of problems just stop existing.

Here’s the tour.

The relational model won, then NoSQL happened, then everyone calmed down

Kleppmann opens with a quick history lesson, and it’s actually load-bearing. The relational model — tables, rows, columns, joins — won decisively in the 1970s and 80s, beating out the hierarchical and network models that came before. It won because it was simpler, more flexible, and because SQL gave you a clean separation between what you wanted and how the database should get it.

Then in the late 2000s, “NoSQL” happened. Suddenly everyone was on MongoDB and the relational model was dead. Except it wasn’t — and the chapter is genuinely useful at explaining why the rebellion happened and why it cooled off.

The NoSQL pitch was real: better scalability for certain workloads, more flexible schemas, query patterns that mapped naturally to specific applications, and a way to dodge the object-relational impedance mismatch — the awkward translation layer between objects in your code and rows in your tables that ORMs spend their entire existence patching over.

But the relational model didn’t die. It absorbed. Today’s databases are increasingly hybrid: PostgreSQL has JSON columns, MongoDB has joins. The interesting question stopped being “which side are you on?” and became “which shape fits your data?”

Documents are great until your data has friends

The core argument for document databases is locality. If your data is shaped like a tree — a user with their profile, their posts, their preferences, all naturally nested inside one logical thing — then storing it as a single document means one read fetches everything. No joins. No N+1 queries. The structure of the data on disk matches the structure your code wants.

For one-to-many relationships, this is genuinely lovely. A blog post and its tags, a resume and its work history, an order and its line items — these are tree-shaped, and trees fit nicely in documents.

The problem starts when relationships go many-to-many. Suppose two users went to the same university. In a document model, you either duplicate the university name in both user documents (and now updating it is a nightmare) or you store an ID and look the university up separately (which is a join, just one your application has to do by hand). Either way, you’ve reinvented something the relational model gave you for free.

Kleppmann’s point isn’t that one model is better. It’s that the more interconnected your data is, the worse documents fit and the better relational fits. If your domain is mostly self-contained units, documents win. If everything points to everything else, you want joins.

Schema-on-read is not the same as no schema

One of the most useful clarifications in the chapter is about schemas. Document databases are often called “schemaless,” and Kleppmann pushes back on this hard. There’s always a schema — the question is just where it lives.

Relational databases use schema-on-write: the database enforces the structure when you insert data. Document databases typically use schema-on-read: the structure is implicit, enforced (or assumed) by whatever code reads the data later.

Neither is automatically better. Schema-on-read is genuinely useful when your data is heterogeneous, when the structure comes from external sources you don’t control, or when you’re moving fast and the schema is changing constantly. Schema-on-write is genuinely useful when you want guarantees, when many different applications read the same data, or when you’ve been burned one too many times by a field that was supposed to always be a string and turned out to sometimes be null, sometimes a number, and sometimes the literal string "undefined".

The flexibility of schema-on-read isn’t free. It’s a debt you pay later, in defensive code and migration scripts and 3am incidents.

Declarative beats imperative, almost always

The middle of the chapter is about query languages, and the central claim is one I think more programmers should sit with: declarative is better than imperative, and SQL is one of the great triumphs of declarative thinking.

In an imperative query, you tell the database how to find your data — loop through this, filter that, sort the other thing. In a declarative query, you describe what you want and let the database figure out how. SQL is declarative; MapReduce code (in its raw form) is imperative; CSS, surprisingly, is declarative — you describe what styled output you want, the browser figures out how to lay it out.

The advantage of declarative isn’t just elegance. It’s that the database is free to optimize. It can add an index, parallelize the query, change the join order, switch algorithms based on table sizes — all without you rewriting anything. Imperative code locks in the how, which means it can’t get faster while you sleep.

This is why ORMs that generate “good enough” SQL beat hand-tuned procedural code in most real systems: the database has more information than your application does about how to actually run the query, and it’s getting smarter every release.

When your data is mostly relationships, use a graph

The third major data model in the chapter is the graph. If documents are great for tree-shaped data and relational is great for tabular data with joins, graphs are great for data where the relationships matter as much as the things being related.

Social networks are the obvious example: who follows whom, who’s friends with whom, who reposted whom’s post. But graphs also fit naturally for road networks, recommendation systems, knowledge bases, fraud detection, and anywhere you find yourself writing recursive SQL queries with five layers of CTEs that no one will ever understand again.

Kleppmann walks through two main flavors. Property graphs (Neo4j, etc.) treat nodes and edges as first-class objects with their own properties, queried with languages like Cypher. Triple-stores (RDF, SPARQL) model everything as (subject, predicate, object) triples, which is conceptually clean and has roots in the Semantic Web movement. Both can express things that would be miserable in SQL — like “find all the people connected to me by at most three hops who work at companies headquartered in cities I’ve visited” — in a few lines.

The historical note he ends on is great: most of these graph query languages descend from Datalog, a declarative logic programming language from the 1980s. Half of computer science is rediscovering ideas from the 1980s with better marketing.

The takeaway

Three data models — relational, document, graph — and they’re not really competitors. They’re tools for different shapes.

Tree-shaped, mostly self-contained data with one-to-many relationships → documents.
Tabular data with lots of many-to-many relationships → relational.
Highly interconnected data where the relationships are the point → graph.

Most real systems eventually use more than one. Your user accounts live in Postgres, your activity feed lives in a document store, your recommendation engine reads from a graph. That’s not a failure of architecture; it’s the architecture working as intended.

The deeper lesson is the one Kleppmann keeps coming back to: data outlives code. The schema you pick today will still be shaping decisions five years from now, long after the framework you used to build the API has been replaced. Choosing the right model early is one of the cheapest performance and maintainability wins in software.

Choose for your data’s shape, not for the conference talk you saw last week.

Notes from Chapter 1 of Designing Data-Intensive Applications

If you’ve spent any time around backend engineers, you’ve probably noticed they love to argue. Postgres or MongoDB? Kafka or RabbitMQ? Microservices or “the modulith”? Most of these debates feel like they’re about technology, but they’re almost never really about technology. They’re about tradeoffs — and the tradeoffs only make sense when you know what you’re optimizing for.

Martin Kleppmann opens Designing Data-Intensive Applications by giving us the vocabulary to have those arguments properly. Chapter 1 is called “Reliable, Scalable, and Maintainable Applications,” and those three adjectives are the entire point of the book. Get them right, and the rest of the 500 pages is essentially a tour of how different systems make different bets in service of those three goals.

Here’s what the chapter actually says, and why it’s worth slowing down on before racing into the chapters about replication and consensus.

What even is a “data-intensive” application?

Kleppmann’s framing in the first few pages is small but important: most of the systems we build today are not bottlenecked by raw CPU. They’re bottlenecked by the amount of data, the complexity of it, or the speed at which it changes. A web app that serves a million users isn’t doing hard math. It’s juggling state — reading it, writing it, caching it, indexing it, replicating it, keeping it consistent enough to be useful and inconsistent enough to be fast.

These applications are built out of remarkably standard parts: databases, caches, search indexes, message queues, stream processors, batch processors. The interesting engineering question isn’t usually “which database?” It’s “how do these pieces fit together for this workload?” Two apps with identical tech stacks can have wildly different architectures because they’ve answered that question differently.

That sets up the rest of the chapter. If your job is gluing data systems together, what are you actually trying to achieve?

Reliability: keep working when things go wrong

The first goal is reliability — and Kleppmann gives a definition that sounds obvious but is genuinely useful: a reliable system continues to work correctly even when things go wrong.

The key distinction here is between a fault (one component misbehaving) and a failure (the whole system stopping). The job of a reliable system isn’t to prevent faults — that’s impossible — it’s to prevent faults from cascading into failures. That’s what “fault-tolerant” means.

Faults come in three flavors:

Hardware faults. Disks die, power cuts out, networks flake. We’ve been dealing with these for decades, mostly through redundancy: RAID arrays, dual power supplies, multiple availability zones. This is the easy category, in the sense that the failure modes are well understood.
Software errors. Bugs, runaway processes, cascading failures where one slow service takes down everything that depends on it. These are nastier because they can hit every replica simultaneously — your fancy redundancy won’t save you if all three nodes have the same bug.
Human errors. And here’s the punchline: humans cause more outages than hardware. The defenses are good abstractions, sandboxed environments for testing, telemetry that catches problems early, and — crucially — making it easy to roll back when someone inevitably ships something broken at 4pm on a Friday.

It’s tempting to skip reliability work on “non-critical” applications, but Kleppmann pushes back on that: the cost of losing user trust usually exceeds the cost of building things properly the first time. A photo app isn’t life-or-death, but if it loses your wedding photos once, you’re never opening it again.

Scalability: cope with growth

Scalability is the one everyone thinks they understand and almost no one defines properly. Kleppmann’s framing is that “scalable” isn’t a property a system has or doesn’t have — it’s a question, and the question only makes sense if you specify two things: what you mean by load, and what you mean by performance.

Load is whatever parameter actually pressures your system. For a web server it might be requests per second. For a cache it might be the hit rate. For a database it might be the read/write ratio. The chapter’s famous Twitter example shows why this matters: serving home timelines is a hard problem, but the right solution depends entirely on whether you optimize for the read path (fan-out on write, materialize each user’s timeline) or the write path (fan-in on read, query everyone’s posts when the user opens the app). Twitter actually switched approaches as their workload changed. Same problem, different load characteristics, different architecture.

Performance is the other side. And here Kleppmann lays down what I’d argue is the single most important graph in backend engineering: don’t use averages, use percentiles. A system with a 100ms average response time can still be miserable to use if the slowest 1% of requests take 10 seconds. Tail latencies — p95, p99, p999 — are what users actually feel, and they tend to disproportionately hit your most engaged customers, the ones who make the most requests and therefore have the most chances to roll the bad-luck dice.

Once you know your load and your performance target, scaling becomes a design problem. You can scale up (bigger machine) or out (more machines), and the right answer depends on the workload. There is no universal scalable architecture. Anyone selling you one is selling you something.

Maintainability: make it livable

The last goal is the one engineers love least and pay for most. Most of a system’s lifetime cost is not in writing it; it’s in keeping it running, evolving it, and onboarding new people to it. Maintainability is the property that lets future-you (or the person who replaces you) keep the lights on without losing their mind.

Kleppmann breaks it into three sub-principles:

Operability. Make it easy for the operations team to keep the system healthy. Good monitoring. Predictable behavior under load. Useful logs. Documentation that exists. Default behaviors that make sense. The classic “it works on my machine” failure is an operability failure.
Simplicity. Manage complexity. Every system accumulates accidental complexity over time, and the main weapon against it is good abstractions — taking something messy and giving it a clean interface. The opposite, which Kleppmann calls a “big ball of mud,” is what happens when you skip this and let everything tangle into everything else.
Evolvability. Make it easy to change. Requirements always change. The org changes, the product changes, the regulations change, the scale changes. Systems that can’t evolve get rewritten, and rewrites are expensive and dangerous.

If you’ve ever worked on a codebase where every change feels like defusing a bomb, you’ve experienced the absence of all three at once.

The takeaway

Reliability, scalability, maintainability. Three words, and the entire book is essentially “how do specific tools and techniques affect these three properties for specific workloads?”

What I find genuinely useful about this chapter is that it gives you a way to evaluate technical decisions without falling into religious wars. When someone says “we should use Kafka,” the question isn’t whether Kafka is good. It’s: which of these three properties does Kafka improve, by how much, for our workload, and at what cost to the others? Sometimes the answer is “a lot, cheaply, do it now.” Sometimes the answer is “not really, and it’ll add a fourth on-call rotation.”

There are no silver bullets. There are only tradeoffs against a specific set of requirements — and the first job of any backend engineer is being able to say what those requirements are.

That’s the real lesson of Chapter 1. The rest of the book is just receipts.

Tag: database design

Your Data Has Opinions About How It Wants to Be Stored

The relational model won, then NoSQL happened, then everyone calmed down

Documents are great until your data has friends

Schema-on-read is not the same as no schema

Declarative beats imperative, almost always

When your data is mostly relationships, use a graph

The takeaway

The Three Words Every Backend Engineer Should Tattoo on Their Forearm

What even is a “data-intensive” application?

Reliability: keep working when things go wrong

Scalability: cope with growth

Maintainability: make it livable

The takeaway