Data & System Architecture, from the ground up Lesson 19 / 80

Document stores: MongoDB and the rise/fall/rebirth

When nested data is the model, what schema-on-read costs, and the operational lessons MongoDB taught the industry.

The previous two lessons covered the relational model (lesson 17) and the key-value model (lesson 18). This one covers the third major family: document stores, where each record is a JSON-shaped tree and the schema is, at least nominally, optional. The canonical product is MongoDB. The history of MongoDB is a useful arc, because it covers, in real time, the cycle of new technology hype, painful operational reality, then steady maturation into a legitimate tool. Reading the arc is the fastest way to understand both what document stores are good for and where the discourse around them went wrong.

The data model

A document store treats each record as a self-contained document, usually JSON or a binary JSON variant (BSON in MongoDB’s case). A document is a tree of named fields whose values can be primitives (strings, numbers, booleans, null, dates), arrays, or nested documents. Documents live in collections, the document-store equivalent of tables. There is no fixed schema declared at the collection level; two documents in the same collection can have different fields, different types for the same field name, or arbitrarily different shapes.

A typical document might look like this, conceptually:

{
  "_id": "post_42",
  "title": "Hello world",
  "author": { "id": "user_7", "name": "Narcis" },
  "tags": ["intro", "meta"],
  "comments": [
    { "by": "user_3", "text": "nice", "at": "2026-04-12T10:30:00Z" },
    { "by": "user_5", "text": "+1",  "at": "2026-04-12T11:02:00Z" }
  ],
  "published_at": "2026-04-12T09:00:00Z"
}

The whole post-with-comments-and-author lives in a single document. A read fetches the whole thing in one go. A write updates fields inside it, possibly atomically. The pitch was: this is closer to how the application thinks about the data, so the impedance mismatch between the database and the object model goes away.

Compared to the relational equivalent, where you would have a posts table, a users table, a comments table, and a tags table, the document model collapses what would have been three or four joins into a single document fetch. For the access pattern “show me a post with everything attached,” that is a real win.

The pitch, circa 2010

MongoDB launched in 2009. The discourse from 2010 to 2013 was relentlessly enthusiastic and the marketing was specific: “schemaless databases are easier”; “JSON in, JSON out, nothing to migrate”; “agile teams ship faster without the friction of schema definitions.” The pitch sold millions of MongoDB deployments, in many cases to teams that did not actually need the document model.

The intellectual move underneath was real even if the execution was overstated. SQL schemas in the early-2010s Rails-and-Django world were genuinely painful: every new feature meant a migration, every migration meant a deploy choreography, ORMs papered over the impedance mismatch and introduced their own bugs. “Just store the JSON” felt liberating. For the prototype phase of a startup, where the product is changing weekly and nobody knows the right schema, the no-schema pitch was attractive.

The pitch had two specific blind spots. First, the absence of a declared schema does not mean the absence of a schema. The schema is implicit, scattered across application code, and unenforced by any single component. Two services write the same collection with subtly different shapes. A field is renamed in one place but not another. A read in a third service crashes on an unexpected type. The schema exists; it is just not where you can see it.

Second, “easier to evolve” was a half-truth. Yes, you can change the shape without a migration. But the old documents, written under the old shape, are still in the database. The application code has to handle every historical shape forever, or you write a migration anyway, but now without the database’s help.

The reality, circa 2015 to 2020

The next half decade was a brutal education for the document-store world, and MongoDB took the brunt of it. Two categories of pain.

The correctness pain came first. Early MongoDB defaults were, charitably, unsafe: writes returned success before being durably persisted, replicas could fall behind without warnings, isolation guarantees were weaker than users assumed. Kyle Kingsbury’s Jepsen tests between 2013 and 2020 found a sequence of issues: lost updates under network partition, stale reads from secondaries, transactions that did not actually behave transactionally. MongoDB Inc. responded, fixed the issues, and shipped better defaults, but the reputational damage took years to repair. “MongoDB is web scale” circulated through the engineering community for most of the 2010s, delivered with a sneer.

The operational pain came alongside. Sharding worked badly in practice for years. Cluster operations (rebalancing, backup, point-in-time recovery, schema changes) were either harder than promised or not actually present. Teams that picked MongoDB on the strength of the developer-experience pitch discovered, six months in, that they were operating a distributed system without the operational maturity Postgres had built up over decades.

The result was a slow, public retreat. The “document store as primary database” narrative faded, and by 2018 the industry consensus was: use document stores only for specific cases where the shape is genuinely right.

The rebirth, circa 2020 to today

If you stopped reading the MongoDB story in 2018 you would have missed the rebound. Across the back half of the decade, MongoDB Inc. did the unglamorous work of fixing the actual problems, and by 2026 the product is meaningfully different from the one Jepsen tested in 2013.

Multi-document transactions arrived in MongoDB 4.0 (2018) with proper ACID within a replica set, then extended to sharded clusters in 4.2. The “documents are atomic so you don’t need transactions” defense was retired. Write and read concerns were tightened: the default write concern is now majority, so a write is acknowledged only after a majority of the replica set has it. Schema validation was added: you can declare a JSON Schema validator on a collection, and the database enforces it. The “schemaless” pitch was quietly replaced by “schema-flexible,” which is the honest version. Aggregation pipelines matured into a serious query language with $lookup joins, window functions, and faceted search. Cluster operations improved across the board.

The Jepsen reports from the 2020s tell a different story than the ones from the 2010s. The product passes tests it used to fail. Multi-document transactions work as advertised, with the caveat that long-running transactions across a sharded cluster have real performance costs. The product is now a legitimate choice for the cases where the document model fits, and the discourse is, finally, calibrated.

When document stores genuinely fit

There is a small, well-defined set of workloads where the document model is the right tool. Each entity is a self-contained tree: a blog post with embedded comments and tags, a product with nested variants and images, a configuration document, a complete order with line items and shipping. The access pattern is “fetch the whole thing, modify part of it, save it back,” and the relational decomposition would split this across four or five tables for no clear benefit at read time. You do not need cross-entity transactions often: the atomicity boundary in a document store is, by default, the document, so single-document transactions are free; if your transactions routinely span many documents, the document model is not saving you anything. Schema evolution as a product matures: early-stage products benefit from adding sparse fields without a migration, though Postgres with JSONB can also handle this. Heterogeneous documents in the same collection: different products with different attributes, different events with different payloads, where fixed columns are awkward and optional fields are natural.

If your workload has one or two of these traits, a document store is a reasonable choice. If it has none, you are paying the cost of an unfamiliar tool for benefits you are not using.

When SQL with JSONB is a better choice

Postgres has had a JSON column type since 2012, and JSONB (binary, indexed, query-friendly) since 2014. By 2026 the feature set is mature: GIN indexes for arbitrary inner-key lookups, expression indexes for specific paths, a rich operator set, JSONPath support, good query-planner integration. The combination of a relational schema for the structured part plus JSONB columns for the flexible part covers most of the use cases the document model was sold for, without giving up the rest of SQL.

The Postgres-with-JSONB pattern wins when you sometimes need relational queries and sometimes document-style (mixing both in MongoDB is awkward; in Postgres it is one query), when you want one engine instead of two (separate backups, monitoring, failover, and expertise are a real cost), or when most of your data is structured but a small part is genuinely flexible.

The pure document-store path wins when the whole data shape is genuinely flexible, when access patterns are document-shaped end to end with very few cross-document operations, or when you are greenfield with the operational expertise and a team comfortable with the ecosystem.

The honest diagnostic for 2026: if you are not sure, default to Postgres with JSONB. The escape hatch (move the JSON parts to a real document store later) stays open. The reverse migration (MongoDB back to Postgres) is harder, because the data has been written under a no-fixed-schema assumption that may have produced inconsistencies you did not know were there.

flowchart LR
    subgraph Document_Model
        D[blog_post document]
        D --> Dt[title]
        D --> Da[author embedded]
        D --> Dc[comments array embedded]
        D --> Dg[tags array embedded]
    end
    subgraph Relational_Model
        P[posts row]
        P -.fk.-> U[users row]
        P -.fk.-> C[comments rows]
        P -.fk.-> T[post_tags rows]
        T -.fk.-> Tg[tags rows]
    end

The same conceptual entity, two different models. Neither is universally right. The document model is one fetch, one write, one consistency boundary. The relational model is more rows, more joins, more flexibility for queries the original designer did not foresee.

The other document stores

MongoDB is the canonical document store but not the only one. Couchbase has a SQL-like query language (N1QL) and is strong at high-throughput single-document workloads. RavenDB is a .NET-leaning document store with a strong ACID focus. Firestore (Google Cloud) is a managed document store with realtime sync, a great fit for “small documents, lots of clients, realtime updates” mobile and web workloads. DocumentDB (AWS) speaks the MongoDB wire protocol but is AWS’s own engine underneath; compatibility is partial, read the fine print. CouchDB is the older Apache project that pioneered some of the patterns, still in use with PouchDB for offline-first web applications.

The trade-off is similar across all of them: gain natural document storage, lose easy ad-hoc cross-entity queries.

The operational lesson

The most important lesson from the MongoDB arc is not about the document model itself. It is about the relative weight of data model versus operational maturity. MongoDB’s 2010s struggles were not because the document model was wrong; they were because the engine had not yet built up the replication, sharding, transaction, and failure-handling stories that the relational world had spent forty years getting right. The MongoDB of 2026 is good not because the document model became more right, but because the operational story finally caught up.

The corollary generalizes: when you are picking a database, the data model matters less than people think, and the operational maturity matters more. A perfect data-model fit on an immature engine produces an outage. A mediocre fit on a battle-tested engine produces friction in application code. The friction is cheaper.

Where this lesson lands

Document stores are legitimate, with a narrower fit than the 2010s discourse suggested. MongoDB is the canonical example, and the product is meaningfully better than its reputation from a decade ago. Postgres with JSONB covers most of the use cases the document model was sold for, while keeping SQL on hand for the queries the document model is bad at. The honest default in 2026 is still Postgres; the document store is reached for when the workload is genuinely document-shaped end to end.

The next major data model is wide-column (Cassandra, ScyllaDB, HBase, Bigtable), which we cover in lesson 20. After that, time-series (lesson 21), search (lesson 22), graph (lesson 23), and finally the synthesis: lesson 24 on polyglot persistence, the realistic architecture for production systems, which is to use the right specialist store for each shape of data, with a small set of well-defined boundaries between them.

Citations and further reading

  • The MongoDB documentation, https://www.mongodb.com/docs/ (retrieved 2026-05-01). Especially the “Transactions”, “Schema Validation”, and “Replication” sections.
  • Kyle Kingsbury, “Jepsen: MongoDB” reports across the years, https://jepsen.io/analyses (retrieved 2026-05-01). The 2013, 2015, 2017, and 2020 reports, read in order, are an education in how a distributed database matures. Read the most recent one for the current state, but the older ones are the more interesting story.
  • Werner Vogels, “Eventually Consistent”, Communications of the ACM, January 2009. Useful background on the consistency models that shaped the early NoSQL movement, MongoDB included.
  • Alex Petrov, “Database Internals” (O’Reilly, 2019). The chapter on storage engines covers the data-structure tradeoffs underneath document stores, key-value stores, and relational databases in a uniform vocabulary.
  • The PostgreSQL JSONB documentation, https://www.postgresql.org/docs/current/datatype-json.html (retrieved 2026-05-01). The reference for the SQL-with-JSONB option, including indexing, operators, and JSONPath.
Search