Notes from Chapter 2 of Designing Data-Intensive Applications
There’s a particular kind of argument that breaks out on engineering teams roughly once a quarter: SQL or NoSQL? It’s usually framed as a question about technology, or performance, or scalability, but it almost never actually is. It’s a question about shape. Specifically: what shape is your data, and what shape do you want to query it in?
Chapter 2 of Designing Data-Intensive Applications is called “Data Models and Query Languages,” and Martin Kleppmann uses it to make a case that’s surprisingly hard to internalize: the data model you pick has more impact on how your software gets written — and how it ages — than almost any other decision. Pick the wrong shape and you’ll spend years gluing it back together with application code. Pick the right one and a lot of problems just stop existing.
Here’s the tour.
The relational model won, then NoSQL happened, then everyone calmed down
Kleppmann opens with a quick history lesson, and it’s actually load-bearing. The relational model — tables, rows, columns, joins — won decisively in the 1970s and 80s, beating out the hierarchical and network models that came before. It won because it was simpler, more flexible, and because SQL gave you a clean separation between what you wanted and how the database should get it.
Then in the late 2000s, “NoSQL” happened. Suddenly everyone was on MongoDB and the relational model was dead. Except it wasn’t — and the chapter is genuinely useful at explaining why the rebellion happened and why it cooled off.
The NoSQL pitch was real: better scalability for certain workloads, more flexible schemas, query patterns that mapped naturally to specific applications, and a way to dodge the object-relational impedance mismatch — the awkward translation layer between objects in your code and rows in your tables that ORMs spend their entire existence patching over.
But the relational model didn’t die. It absorbed. Today’s databases are increasingly hybrid: PostgreSQL has JSON columns, MongoDB has joins. The interesting question stopped being “which side are you on?” and became “which shape fits your data?”
Documents are great until your data has friends
The core argument for document databases is locality. If your data is shaped like a tree — a user with their profile, their posts, their preferences, all naturally nested inside one logical thing — then storing it as a single document means one read fetches everything. No joins. No N+1 queries. The structure of the data on disk matches the structure your code wants.
For one-to-many relationships, this is genuinely lovely. A blog post and its tags, a resume and its work history, an order and its line items — these are tree-shaped, and trees fit nicely in documents.
The problem starts when relationships go many-to-many. Suppose two users went to the same university. In a document model, you either duplicate the university name in both user documents (and now updating it is a nightmare) or you store an ID and look the university up separately (which is a join, just one your application has to do by hand). Either way, you’ve reinvented something the relational model gave you for free.
Kleppmann’s point isn’t that one model is better. It’s that the more interconnected your data is, the worse documents fit and the better relational fits. If your domain is mostly self-contained units, documents win. If everything points to everything else, you want joins.
Schema-on-read is not the same as no schema
One of the most useful clarifications in the chapter is about schemas. Document databases are often called “schemaless,” and Kleppmann pushes back on this hard. There’s always a schema — the question is just where it lives.
Relational databases use schema-on-write: the database enforces the structure when you insert data. Document databases typically use schema-on-read: the structure is implicit, enforced (or assumed) by whatever code reads the data later.
Neither is automatically better. Schema-on-read is genuinely useful when your data is heterogeneous, when the structure comes from external sources you don’t control, or when you’re moving fast and the schema is changing constantly. Schema-on-write is genuinely useful when you want guarantees, when many different applications read the same data, or when you’ve been burned one too many times by a field that was supposed to always be a string and turned out to sometimes be null, sometimes a number, and sometimes the literal string "undefined".
The flexibility of schema-on-read isn’t free. It’s a debt you pay later, in defensive code and migration scripts and 3am incidents.
Declarative beats imperative, almost always
The middle of the chapter is about query languages, and the central claim is one I think more programmers should sit with: declarative is better than imperative, and SQL is one of the great triumphs of declarative thinking.
In an imperative query, you tell the database how to find your data — loop through this, filter that, sort the other thing. In a declarative query, you describe what you want and let the database figure out how. SQL is declarative; MapReduce code (in its raw form) is imperative; CSS, surprisingly, is declarative — you describe what styled output you want, the browser figures out how to lay it out.
The advantage of declarative isn’t just elegance. It’s that the database is free to optimize. It can add an index, parallelize the query, change the join order, switch algorithms based on table sizes — all without you rewriting anything. Imperative code locks in the how, which means it can’t get faster while you sleep.
This is why ORMs that generate “good enough” SQL beat hand-tuned procedural code in most real systems: the database has more information than your application does about how to actually run the query, and it’s getting smarter every release.
When your data is mostly relationships, use a graph
The third major data model in the chapter is the graph. If documents are great for tree-shaped data and relational is great for tabular data with joins, graphs are great for data where the relationships matter as much as the things being related.
Social networks are the obvious example: who follows whom, who’s friends with whom, who reposted whom’s post. But graphs also fit naturally for road networks, recommendation systems, knowledge bases, fraud detection, and anywhere you find yourself writing recursive SQL queries with five layers of CTEs that no one will ever understand again.
Kleppmann walks through two main flavors. Property graphs (Neo4j, etc.) treat nodes and edges as first-class objects with their own properties, queried with languages like Cypher. Triple-stores (RDF, SPARQL) model everything as (subject, predicate, object) triples, which is conceptually clean and has roots in the Semantic Web movement. Both can express things that would be miserable in SQL — like “find all the people connected to me by at most three hops who work at companies headquartered in cities I’ve visited” — in a few lines.
The historical note he ends on is great: most of these graph query languages descend from Datalog, a declarative logic programming language from the 1980s. Half of computer science is rediscovering ideas from the 1980s with better marketing.
The takeaway
Three data models — relational, document, graph — and they’re not really competitors. They’re tools for different shapes.
- Tree-shaped, mostly self-contained data with one-to-many relationships → documents.
- Tabular data with lots of many-to-many relationships → relational.
- Highly interconnected data where the relationships are the point → graph.
Most real systems eventually use more than one. Your user accounts live in Postgres, your activity feed lives in a document store, your recommendation engine reads from a graph. That’s not a failure of architecture; it’s the architecture working as intended.
The deeper lesson is the one Kleppmann keeps coming back to: data outlives code. The schema you pick today will still be shaping decisions five years from now, long after the framework you used to build the API has been replaced. Choosing the right model early is one of the cheapest performance and maintainability wins in software.
Choose for your data’s shape, not for the conference talk you saw last week.







