Three case studies of 'we should have started simpler'

The previous lesson laid out the symptoms that tell you the single-server architecture is no longer enough. The implicit message was reassuring: the architecture you built in lessons 1 to 6 is fine until it isn’t, and we will spend the rest of the course teaching you how to evolve it.

This lesson is the other half of the message. Most teams over-engineer their first architecture. They reach for microservices, distributed databases, Kubernetes, service meshes, event sourcing, and CQRS, all before they have the load to justify any of it. The result is years of operational burden for capabilities they do not use, talent they cannot hire, and complexity they cannot debug at three in the morning.

Three companies are useful here, because they all resisted the trend, in public, with detailed engineering write-ups, and at scales much larger than yours. Stripe stayed on Postgres far longer than the industry expected. Shopify ran a Rails monolith into the multi-billion-dollar-merchant range. Basecamp wrote a manifesto about it.

If your first reaction is “well, those are special cases,” the rest of this lesson is for you. Most architectures are special cases. The question is whether yours is special in a way that justifies the complexity, or whether you are, like most teams, copying a pattern that was built to solve a problem you do not have.

Case 1: Stripe and the long Postgres bet

Stripe processes payments for a meaningful fraction of the internet’s commerce. It is a real-time, money-moving, latency-sensitive, audit-heavy, financially regulated workload. By the conventional wisdom of the 2015 to 2020 era, that is exactly the kind of workload you are supposed to put on a custom-built distributed data store. NoSQL, sharded, eventually consistent, with a homegrown coordination layer.

That is not what Stripe did. Stripe ran, and at the time of writing in 2026 still largely runs, the bulk of its production transactional data on a small set of carefully-tended Postgres instances. Sharded, yes, with their own sharding layer on top. Heavily tuned. Replicated across availability zones. But still Postgres. Still SQL. Still ACID. Still a relational database that a database administrator from 2005 would recognize.

Two engineering posts are worth reading in full to understand the bet. The first is “Online Migrations at Scale” (https://stripe.com/blog/online-migrations), which describes the patterns Stripe uses to evolve a Postgres schema while serving billions of dollars of traffic. It is the practical companion to symptom 6 from the previous lesson. The second is the “Ringpop, sharding, and database operations” series of talks the Stripe data team gave at conferences between 2018 and 2020, which describe the sharding layer they built on top of Postgres rather than under it.

The lessons that fall out of the Stripe story:

Postgres at the right size, with the right care, handles enormous load. The ceiling is much higher than most teams assume because most teams have never seen a properly tuned Postgres instance with adequate hardware, sensible queries, good indexes, and a competent operator. They have only seen the version that is too small for the job.
The cost of moving off the relational model is huge. You give up SQL, transactions, joins, foreign keys, the rich ecosystem of tools, the cumulative wisdom of forty years of database administration, and the ability to hire engineers who already know how it works. In exchange you get write throughput at the cost of consistency. For most workloads, that trade is bad.
The cost of investing in vertical scaling and good database operations is small in comparison. A senior database engineer is expensive but not as expensive as the team you would need to operate a custom distributed store.

The Stripe move that is hardest to copy is the discipline. They kept the system simple even when it was fashionable to make it complicated, and they invested in the boring, unsexy work of making Postgres work at their scale. The result is a payments platform that, in terms of architectural complexity, looks more like a 2010 startup than a 2020 unicorn.

Case 2: Shopify and the modular monolith

Shopify is an e-commerce platform. It hosts millions of merchants, processes hundreds of billions of dollars of gross merchandise volume per year, and runs Black Friday traffic that for one weekend a year exceeds most other websites’ annual peaks. It is, by any measure, a large-scale distributed system.

It is also, structurally, a Rails monolith. Or rather, it was for most of its life, and the move away from “monolith” toward “modular monolith” has been a deliberate, slow, public, and educational journey. The canonical reference is the Shopify Engineering post “Deconstructing the Monolith: Designing Software That Maximizes Developer Productivity” (https://shopify.engineering/deconstructing-monolith-designing-software-maximizes-developer-productivity).

The Shopify argument is roughly this: the problem with a monolith is not that the code lives in one repository or runs in one process. The problem is when the code lacks internal boundaries. A “big ball of mud” monolith, where any module can call any other module, where domain concerns leak across files, where a change in billing requires a change in shipping, is genuinely hard to scale, both technically and organizationally. But that is not a property of monoliths. It is a property of bad code.

A modular monolith keeps the deployment simple (one app, one process group, one deploy pipeline) while imposing internal boundaries that look a lot like service boundaries. Shopify uses Ruby gems internally, with explicit public APIs, and treats inter-module communication as if it were a network call even when it is just a method invocation. The result is that the modules can be reasoned about independently, evolved independently, and eventually, if necessary, extracted into separate services. But until that necessity hits, the team enjoys the operational simplicity of one deployment.

The lessons:

The monolith is a deployment choice, not a code-organization sin. You can have a monolithic deployment with a clean modular structure, or a microservices deployment with a tangled domain model. The latter is much worse.
Microservices without discipline are slower and more error-prone than a monolith without discipline. The “without discipline” part is doing a lot of work in that sentence. Most teams do not have the discipline microservices require, because microservices require an organizational maturity (team independence, ownership, on-call rotation, observability, deployment infrastructure) that most teams have not built yet.
The right time to extract a service is when you have measured pain that a service would solve. Shopify has extracted services. They did it for specific reasons: a checkout component that needed to scale independently, a real-time inventory store that needed different consistency guarantees, a payments path that needed isolation. They did not extract for fashion.

The Shopify story is the antidote to the “we are too big for a monolith” argument. They are larger than you, they were larger than you when they made the call, and the call was: keep the monolith, invest in modularity.

Case 3: Basecamp and the majestic monolith

If Stripe is the engineering case and Shopify is the architectural case, Basecamp is the philosophical case. David Heinemeier Hansson, the creator of Ruby on Rails and a co-founder of 37signals, wrote an essay in 2016 called “The Majestic Monolith” (https://m.signalvnoise.com/the-majestic-monolith/). It is one of the most-cited and most-misunderstood pieces in the architecture canon. Read it in full if you have not.

The argument is not that monoliths are universally correct. The argument is that for a small product team (37signals had about a dozen engineers when the post was written), the cognitive overhead of a distributed system erases any throughput gain you would get from scaling out. A team of twelve engineers running a Rails monolith can ship product faster, debug faster, onboard faster, and operate the system at lower cost than the same twelve engineers running fifteen microservices. The throughput math is not about machines; it is about humans.

The post is sometimes read as anti-microservices. It is not. It is anti-premature-microservices. The distinction matters. There is a load level, a team size, an organizational complexity, at which microservices pay off. The mistake is assuming you are at that level when you are not.

The lessons:

Architecture is constrained by team size as much as by load. A two-pizza team running ten services is spending more time on operational coordination than on product. A two-pizza team running one well-organized service is shipping.
The most expensive architecture is the one that is bigger than your problem. Stripe’s lesson is that simple goes far. Shopify’s lesson is that modularity inside the monolith captures most of the benefit of services. Basecamp’s lesson is that for a small team, the simple architecture is also the fast one.
Refactor toward services when the pain forces you to, not when the conference talks tell you to.

The cross-cutting takeaway from all three case studies is the same. Defer architectural complexity. Pay the operational cost of distributed systems only when the load demands it. The rest of the time, invest in the simple architecture: better queries, better indexes, better tests, better deploys, better observability. That work compounds. Architectural complexity for its own sake does not.

Three timelines on one page

flowchart LR
    subgraph stripe ["Stripe (Postgres-first)"]
        direction LR
        s1["<b>2010</b><br/>Founded on Postgres"] --> s2["<b>2014</b><br/>Realtime fraud<br/>and ledger on PG"] --> s3["<b>2017</b><br/>Sharding layer<br/>on top of PG"] --> s4["<b>2020</b><br/>Online migration<br/>patterns shared"] --> s5["<b>2024</b><br/>Bulk of prod<br/>still on PG"]
    end

    subgraph shopify ["Shopify (Rails monolith)"]
        direction LR
        h1["<b>2004</b><br/>Founded as<br/>a Rails app"] --> h2["<b>2014</b><br/>Multi-billion GMV<br/>on the monolith"] --> h3["<b>2016</b><br/>Modular monolith<br/>strategy"] --> h4["<b>2019</b><br/>Deconstructing<br/>the Monolith"] --> h5["<b>2024</b><br/>Selective<br/>extractions"]
    end

    subgraph basecamp ["Basecamp and 37signals"]
        direction LR
        b1["<b>2004</b><br/>Basecamp<br/>on Rails"] --> b2["<b>2016</b><br/>Majestic Monolith<br/>essay"] --> b3["<b>2020</b><br/>HEY launched,<br/>also a monolith"] --> b4["<b>2024</b><br/>Small set of<br/>monoliths,<br/>dozen engineers"]
    end

    classDef yr fill:#1f2933,stroke:#0d9488,color:#e8edf1
    classDef boundary fill:transparent,stroke:#0d9488,stroke-dasharray: 5 5
    class s1,s2,s3,s4,s5,h1,h2,h3,h4,h5,b1,b2,b3,b4 yr
    class stripe,shopify,basecamp boundary

The visual point of the timeline is that the simple architecture was not a brief phase before the inevitable rewrite. In all three cases it lasted, and is still lasting, well past the scale at which most teams assume they need to move on. The rewrite often does not come. When it does, it is targeted, slow, and justified by specific pain.

What the case studies do not say

It is worth being explicit about what these stories do not prove, because the discourse around them tends to overreach in both directions.

They do not prove that monoliths are always correct. They are not. There are workloads (real-time bidding, geographically distributed read-heavy serving, machine-learning inference at edge) where a single monolith genuinely cannot do the job, and where a service-oriented or distributed architecture is the right starting point. Those workloads are rarer than the discourse suggests, but they exist.

They do not prove that Postgres is always correct. A workload that is fundamentally analytical (terabytes of telemetry, ad-hoc OLAP queries, fan-out aggregations) belongs on a columnar store, not on Postgres. A workload that is fundamentally key-value with high write throughput and low consistency requirements belongs on a key-value store. The Stripe lesson is that transactional workloads, even at very large scale, can stay on Postgres. It is not that all workloads should.

They do not prove that microservices are bad. They prove that microservices are expensive, and that the expense is rarely justified before specific scale and organizational thresholds. After those thresholds, microservices can be the right choice. Shopify uses some. Stripe uses some. The point is that “some” is much smaller than “all,” and it is selected by need rather than by default.

What the case studies do prove is a single, repeatable pattern: the teams that resisted complexity until the symptoms forced their hand built systems that lasted, evolved, and stayed cheap to run. The teams that adopted complexity early, for fashion or for fear, built systems that were expensive to run and slow to change. The first group includes Stripe, Shopify, and Basecamp. The second group is unnamed because the post-mortems are private. They exist. You may have worked at one.

Where the course goes from here

You have made it to the end of Module 1. The takeaway from these eight lessons is, in one sentence: start simple, recognize the symptoms when they appear, and move only when the symptoms force the move. The architecture in lessons 1 to 6 is enough for most teams for most of their useful life.

But “most” is not “all.” Some teams genuinely outgrow the single server, and the question becomes how to evolve the architecture without making it worse. That is what the rest of the course is about, and the next module starts with the fundamentals of distributed systems: what changes the moment you stop having one machine, why distributed systems are harder than they look, and what mental models you need before you write a single line of code that talks to another machine over a network.

Module 2 starts in the next lesson with one of the foundational ideas of the field: the eight fallacies of distributed computing. The list is thirty years old and still painfully relevant. We will go through it one fallacy at a time, with examples from systems people built that failed because they assumed each fallacy was true.

Stripe, Shopify, and Basecamp deferred that complexity for as long as they could. We will spend the next several lessons learning what they were deferring.

Citations and further reading

Stripe Engineering, “Online Migrations at Scale”, https://stripe.com/blog/online-migrations (retrieved 2026-05-01).
Shopify Engineering, “Deconstructing the Monolith: Designing Software That Maximizes Developer Productivity”, https://shopify.engineering/deconstructing-monolith-designing-software-maximizes-developer-productivity (retrieved 2026-05-01).
David Heinemeier Hansson, “The Majestic Monolith”, https://m.signalvnoise.com/the-majestic-monolith/ (retrieved 2026-05-01).
Stripe sharding and operations talks, available via the Stripe Engineering YouTube channel and conference archives from 2018 through 2020.