What software architecture actually is

Welcome to lesson one of the Data and System Architecture course. Eighty lessons, starting from a single VM running a Python script and ending with a global multi-region system that doesn’t fall over when one of its data centres catches fire. We will go through the messy middle in order: how a system grows from one box to two, from two to a small fleet, from a small fleet to a region, and from a region to the planet. We will cover the trade-offs along the way, the things you cannot un-decide once you’ve decided them, and the surprising number of failures that come from people forgetting that the network is not, in fact, reliable.

This course leans heavily on three pieces of writing that I think every serious systems person should read: Martin Kleppmann’s “Designing Data-Intensive Applications”, Sam Newman’s “Building Microservices”, and the Google SRE workbook. None of those are reproduced here. I borrow vocabulary from them, occasional examples, and the general taste of how to think about distributed systems, but the lessons are written from scratch with the running examples and the angle I find useful. If you finish this course and want to go deeper, those three are where you go.

Before we touch a single architecture diagram, we have to answer a question that sounds suspiciously like undergraduate philosophy and turns out to actually matter when you are sitting in a meeting: what is software architecture, exactly? It’s worth getting this right because the wrong definition will lead you to argue about the wrong things for the rest of your career.

The classic definitions, and why they don’t help

The most-cited textbook definition comes from Bass, Clements, and Kazman in “Software Architecture in Practice”:

The software architecture of a system is the set of structures needed to reason about the system, which comprise software elements, relations among them, and properties of both.

That is a perfectly accurate sentence. It is also, to be blunt, useless when you’re standing in front of a whiteboard at 4pm on a Tuesday trying to decide whether to use Postgres or DynamoDB. It tells you that architecture is “structures and elements and relations and properties,” which is a bit like saying that a meal is “ingredients and combinations and flavours and presentations.” Technically true. Doesn’t help you cook dinner.

Martin Fowler, who has thought about this more than most, gave up trying to find a clean definition and instead quoted Ralph Johnson:

Architecture is about the important stuff. Whatever that is.

This is closer to the truth and also obviously circular. What’s important? Important to whom? When? Fowler’s point in citing it is that architects on a real project end up agreeing, mostly intuitively, on what the “important stuff” is, and that’s what they spend their time on. The definition is a description of what architects actually do, not a procedure for figuring out what to do.

I have read every variation of this definition that exists, and they all share a single problem. They describe architecture as a thing (a set of structures, a set of decisions, a set of important things) without telling you how to recognise whether a given choice you’re about to make is architectural or not. And recognising that is the entire skill. If you can spot the architectural decision in a pile of choices, you know which ones to slow down on, which ones to debate, which ones to write down, and which ones to just ship.

So here is the working definition I’m going to use for the next eighty lessons.

The working definition: architecture is what’s expensive to change

Architecture is the set of decisions that are expensive to change later.

That’s it. If a decision is hard to change once it’s in production, it’s architectural. If it’s easy to change, it isn’t.

Ralph Johnson’s “important stuff” is just a less precise way of saying the same thing. The reason a decision is “important” in the architectural sense is because reversing it would cost you weeks, months, or years. The decisions that don’t have that property aren’t architectural, no matter how much they feel weighty in the moment. The choice of variable naming convention feels weighty in a code review and it isn’t architectural. The choice of programming language for a service feels routine in week one and it absolutely is.

This definition has a useful corollary: architectural decisions are not the same as good engineering decisions, and they are not always the same as the decisions you spend the most time arguing about. Some architectural decisions get made in five seconds because the answer is obvious (“we’ll use Postgres because everyone here knows Postgres”). Some non-architectural decisions consume a week of debate (“what should the JSON field naming convention be”). The amount of arguing is not a reliable signal. The cost of reversal is.

Architectural choices versus design choices

Let’s make this concrete. Here is a list of choices a team might make on a typical project. Some of them are architectural in our sense; some are not. Walk through them with the “expensive to change?” test in your head.

Choice of database. Architectural. Switching from Postgres to DynamoDB three years in is a multi-month project. You’ll rewrite the data access layer, redesign your schemas, redo your indexes, change your transactional patterns, retrain everyone, and re-run your performance tests. People do this and they remember it forever.
Choice of programming language for a service. Architectural. Rewriting a 50,000 line Java service in Go is not something you do over a weekend. Even rewriting it gradually with a strangler pattern is a year of someone’s life.
Synchronous HTTP versus an event bus for service-to-service communication. Architectural. The whole shape of how services interact, fail, retry, and observe each other depends on this. Switching one for the other touches every endpoint in the system.
Single-region versus multi-region deployment. Extremely architectural. The latency assumptions, the consistency model, the way you handle failover, and the way you bill the cloud provider all change.
Choice of REST versus GraphQL for a public API. Mostly architectural, because once external clients depend on it, deprecating it takes years of polite emails.
Method name in an internal class. Not architectural. Rename it, run the tests, ship it.
Naming convention for environment variables. Not architectural. Annoying to change but cheap.
Whether to use a logging framework or print statements during early prototyping. Not architectural. You’ll switch to a framework the day you start running in staging.
Choice of auth provider (Auth0 versus Cognito versus Keycloak versus rolling your own). Architectural. Migrating users between auth systems is a real project, partly because you cannot migrate password hashes between providers without forcing every user to reset.
The exact retry-and-backoff policy for a single HTTP call. Almost never architectural. Tweak it on Tuesday, deploy it Wednesday, watch it work better.

The pattern: anything that touches schema, contract, protocol, language, deployment topology, identity, or data ownership tends to be architectural. Anything that touches names, formats, internal helpers, or local algorithms usually isn’t. There are exceptions. The list is a starting point, not a checklist.

The decision irreversibility ladder

The “expensive versus cheap to change” cut is a useful first sieve, but in practice cost-of-reversal is a continuum, not a binary. I find it helpful to think of it as a ladder with at least three rungs:

Reversible in a day. A new feature flag. A logging level change. A SQL index. A renamed variable. A bumped library version, assuming nothing breaks. You can ship the change in the morning and roll it back in the afternoon if it goes badly.

Reversible in a quarter. A new internal service. A swap of one cache technology for another. A move from one cloud region to another within the same provider. A switch from one ORM to a different one, in a codebase small enough that “rewrite the data layer” is feasible inside a quarter. These changes need a project plan and at least one engineer working on them for weeks. They get rolled back, they just don’t get rolled back casually.

Reversible in a year, or never. Choice of database family. Choice of cloud provider. Choice of authentication system. Choice of public API shape. Choice of geographic deployment topology. Choice of data ownership boundaries between teams. Once these are in production with a customer base on top of them, getting out is a major company effort. Some are not even technically reversible without losing data or breaking contracts; you just live with them and route around them.

The skill of architecture is recognising which rung a given decision is on, and matching the seriousness of your process to the rung. A day-rung decision can be made by one engineer in fifteen minutes. A quarter-rung decision deserves an architecture decision record (ADR) and a couple of hours of design. A year-rung decision deserves real research, real prototyping, and a small panel of people who have made similar choices before agreeing it’s the right call.

The classic mistake of junior teams is treating every choice as a year-rung decision and never shipping anything. The classic mistake of senior teams under deadline pressure is treating year-rung decisions as quarter-rung ones because they don’t feel like a big deal in the moment. Both are bad. The job is to know which is which.

flowchart LR
    A[Easy to change<br/>hours to days] --> B[Medium<br/>weeks to a quarter]
    B --> C[Hard<br/>quarter-plus to never]
    A1[variable name<br/>log level<br/>retry policy<br/>feature flag] -.examples.-> A
    B1[new internal service<br/>cache swap<br/>region within cloud<br/>ORM change] -.examples.-> B
    C1[database family<br/>cloud provider<br/>auth provider<br/>public API shape<br/>multi-region topology] -.examples.-> C

The middle rung is interesting because that’s where most of the actual work of “architecture” lives. Year-rung decisions are rare; you make a handful of them per system, ever. Day-rung decisions are constant and don’t need ceremony. The quarter-rung decisions are the ones where having a deliberate process pays off, because they happen often enough that bad ones accumulate, and they’re expensive enough that you can’t afford to make them sloppily.

What this course covers, and what it doesn’t

The eighty lessons are organised into ten modules. At a glance:

Foundations (lessons 1 to 8). Definitions, requirements, the C4 model, the basic vocabulary. You’re in lesson 1 right now.
One machine (lessons 9 to 16). What you can do on a single VM. Process model, threading, async, file I/O, the local OS as a system.
Storage and databases (lessons 17 to 24). Relational versus document versus key-value versus columnar. Indexes, transactions, isolation levels.
Two machines and the network (lessons 25 to 32). Latency, bandwidth, the fallacies of distributed computing, RPC versus REST versus gRPC, idempotency.
Caching, queues, and async (lessons 33 to 40). Redis, Kafka, message brokers, eventual consistency, the outbox pattern.
Service decomposition (lessons 41 to 48). Monoliths versus modular monoliths versus microservices. When to split, when not to, how to draw the seams.
Reliability and observability (lessons 49 to 56). SLOs, error budgets, logging, metrics, tracing, on-call.
Scaling out (lessons 57 to 64). Sharding, partitioning, consistent hashing, leader election, distributed consensus in just enough depth.
Multi-region and global systems (lessons 65 to 72). Active-passive versus active-active, geo-routing, conflict resolution, the actual cost of going global.
Practice and decision-making (lessons 73 to 80). ADRs, fitness functions, evolutionary architecture, interviews, and how to tell when you’re over-engineering.

Diagram to create: A 10-box flowchart laid out in a 5x2 grid, each box labelled with a module number and topic cluster. Module 1 (top-left, Foundations) and Module 10 (bottom-right, Practice) shaded a different colour from the eight middle modules to mark them as bookends. Arrows running left-to-right and top-to-bottom showing the natural reading order. Below each box, two or three sub-topics in smaller text (for example under “Storage and databases” list “relational, document, columnar”). Title at the top: “Data and System Architecture: 80 lessons in 10 modules.”

What this course is not: a tutorial on any specific cloud provider, a deep dive into one programming language, or a manifesto for one architectural style. I’ll mention AWS, Azure, GCP, and the major open-source projects throughout, but the goal is for you to be able to reason about systems regardless of which buttons you happen to be clicking this year. The cloud providers will rename their products three more times between when I write this and when you read it. The principles will not.

What you should walk away from this lesson with

One sentence: architecture is the set of decisions that are expensive to change later.

Two corollaries: the volume of arguing is not a reliable signal of how architectural a decision is, and the cost-of-reversal lives on a ladder, not a binary.

One habit: when you’re about to make a technical decision, take five seconds to ask “if I’m wrong about this, what does it cost to undo?” If the answer is “an afternoon,” go fast. If the answer is “a quarter,” slow down and write it down. If the answer is “a year,” call in reinforcements.

The next lesson tackles the other half of the architectural input: requirements. Specifically, the difference between functional requirements (what the system does) and non-functional requirements (how well it does it), and why the second category is what actually drives the architecture. See you there.

References

Bass, Clements, Kazman. Software Architecture in Practice, 4th edition (2021).
Martin Fowler. Patterns of Enterprise Application Architecture (2002), and his ongoing essays at martinfowler.com.
Simon Brown. The C4 model, https://c4model.com (retrieved 2026-05-01). Covered in lesson 3.
Martin Kleppmann. Designing Data-Intensive Applications (2017).
Sam Newman. Building Microservices, 2nd edition (2021).
Beyer, Jones, Petoff, Murphy. Site Reliability Engineering and The Site Reliability Workbook (Google, 2016 and 2018).