Why distributed systems are hard: the 8 fallacies

Welcome to Module 2. Module 1 was about the architecture you build before you have to think about distributed systems: one machine, one process, one database, a fast loop, and the discipline to keep it that way for as long as the load lets you. Stripe, Shopify and Basecamp showed up in lesson 8 as evidence that “as long as the load lets you” is usually longer than people think.

But sooner or later, on most ambitious systems, the single machine stops being enough. You add a second one. And the moment you do, you have a distributed system, with all of the new failure modes that come with it. The whole rest of this course is, in some sense, a reckoning with that single jump from one box to two.

Before we get to specific patterns, we have to confront the lies. Distributed systems are hard not because the algorithms are exotic but because beginners (and a worrying number of people who should know better) reason about a network as if it were a function call. It is not. The differences are large, expensive, and the source of approximately every production incident in the history of distributed software.

The most useful enumeration of those lies is a list of eight statements written by Peter Deutsch at Sun Microsystems in 1994, with two more added by James Gosling in 1997. The list is called the eight fallacies of distributed computing. Every one of them is something that sounds true the first time you hear it, is in fact false, and will cause you to ship code that breaks in surprising ways the moment your environment gets unfriendly.

The list, in canonical order:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

We will go through each one and look at the failure mode it produces. The thesis of this lesson is simple: distributed-system bugs come from believing one of these. Every architecture pattern in the rest of the course exists, in part, to make the failure recoverable rather than catastrophic.

1. The network is reliable

It is not. Networks drop packets, routers reboot, links flap, fibre cuts happen, BGP misconfigurations black-hole entire regions, cloud providers have outages roughly once a quarter, and the load balancer in front of your service hangs at exactly the moment the on-call engineer is out for dinner. Any of these turns a “synchronous call to another service” into “a request that may have happened, may not have happened, or may have happened twice.”

Why is this fallacy tempting? Because in development you call other services on localhost, where the loopback interface really is reliable. The first time your code runs in a real environment with a real network between two real machines, the assumption breaks.

The concrete failure mode: your POST /charge request times out. Did the charge happen? You don’t know. The retry logic kicks in and sends the request again. If the original happened to succeed, you’ve now charged the customer twice. If you didn’t write idempotency keys into your protocol, you have a financial bug that nobody can untangle without manual intervention.

The architectural answer is a small set of patterns we will spend later lessons on: idempotency keys, retries with exponential backoff and jitter, timeouts at every hop, circuit breakers, and the assumption that any single network call may need to be replayed without harm.

2. Latency is zero

It is not. A round trip on a fast LAN is on the order of half a millisecond. Across availability zones inside a cloud region, one to two milliseconds. Across regions, twenty to two hundred. Across continents, a hundred to three hundred. Light has a speed limit and so do you.

The fallacy is tempting because in-process function calls are effectively free at this scale. Your IDE shows the same syntax for a local method call and a remote one (especially in stub-generating systems like gRPC), so it is easy to forget that one of them costs nanoseconds and the other costs ten million times more.

The classic failure mode is the chatty API. Your “list orders” endpoint returns a list of order IDs. The frontend then makes one call per ID to get the order details. On a single machine, fine. Across a network with twenty millisecond round-trip times, fetching a hundred orders takes two seconds before any of the order data has arrived. This is the N+1 problem you may have met in databases, and it gets dramatically worse over the wire.

The architectural answer is to design APIs that batch, paginate, and return enough data per request that the round-trip count stays low. This rule will come up again when we discuss API design, gRPC versus REST, and the difference between request-response and streaming protocols.

3. Bandwidth is infinite

It is not. A 10 gigabit-per-second link sounds like a lot until you try to copy a hundred-gigabyte snapshot across it during the working day, at which point everyone else’s request slows down. Egress fees from your cloud provider sound trivial until you ship a feature that copies large blobs cross-region and discover at the end of the month that you have spent more on bandwidth than on compute.

This fallacy is tempting because in development the data is small, the link is local, and bandwidth feels free. In production the data is large, the link is shared, and the bill is monthly.

The failure modes are slow page loads from over-fetching, congested links during backups, and surprise egress charges. We will spend a later lesson on data locality (keep the compute close to the data, not the other way around) and on the patterns for moving large data sets without saturating the production path.

4. The network is secure

It is not. Unencrypted traffic between services in the same data centre was the default for years, on the theory that the data centre’s perimeter was the security boundary. Then someone got into the perimeter (every year, multiple times, at companies that should know better) and discovered that the soft chewy centre had no authentication on internal RPCs.

The fallacy is tempting because writing TLS configuration is annoying and the inside of your VPC feels like a private place. It is not private. It contains your code, your dependencies, your colleagues’ code, sometimes a third-party agent that has been compromised upstream, and traffic from a thousand machines you do not personally know.

The current best practice is zero-trust networking: every service authenticates every other service with mutual TLS, every call carries a short-lived token, and “inside the perimeter” provides exactly zero implicit trust. This is what tools like SPIFFE, Istio, and Linkerd exist to make tractable. We will return to this in the security and operability lessons.

5. Topology doesn’t change

It does. Services come up, go down, get redeployed, get rescheduled by Kubernetes onto a different node, change IP, change port, lose their DNS record for thirty seconds during a rollout, get scaled from three to ten instances and back, get blue-green deployed with two parallel sets of pods. The topology you sketched on the whiteboard is a snapshot of one moment.

The fallacy is tempting because in development the address of the database is in a config file and never changes. In production the address is “wherever the orchestrator put it this minute” and the config file is a stale lie.

The failure mode is hardcoded IP addresses, DNS TTLs that lie, connections to instances that no longer exist, and the entire category of bug where a client caches a service endpoint and keeps trying to reach it long after it has moved. The architectural answer is service discovery (Consul, etcd, Kubernetes services), connection pooling that refreshes, and the discipline of treating any address as ephemeral.

6. There is one administrator

There is not. Your system runs on infrastructure managed by your platform team, your cloud provider, your DNS provider, your CDN, your TLS certificate authority, your observability vendor, and probably a third-party API or three. Each of those is a separate organisation with separate priorities, separate maintenance windows, and separate ideas about what “graceful degradation” means.

The fallacy is tempting in a startup of five people where the one engineer who knows everything is in the room. It stops being tempting the first time AWS has an outage in us-east-1 and you discover that your “internal” service depends on a us-east-1-only feature of a third-party SaaS you forgot you were using.

The architectural consequence is dependency mapping, vendor risk analysis, the principle of designing for graceful degradation when an upstream is down, and the understanding that your system’s reliability is the product of every component’s reliability, not just the parts you wrote.

7. Transport cost is zero

It is not. Cross-AZ traffic in AWS costs about one cent per gigabyte each way as of 2026. Cross-region is many times that. NAT Gateway charges add another fee per gigabyte processed. Egress to the internet is the most expensive of all. A chatty service architecture that crosses an AZ boundary on every request can run up a bill that dwarfs the compute cost.

This fallacy is tempting because the network does not appear in your hourly compute bill. It appears in a separate line item that nobody looks at until the CFO asks about it.

The failure mode is the surprise bill. We will revisit this when we cover multi-AZ and multi-region patterns and the trade-off between availability (more zones) and cost (each zone-crossing has a price).

8. The network is homogeneous

It is not. Your traffic crosses Linux machines, Windows machines, ARM machines and x86 machines. It traverses routers from three vendors. It hits load balancers that interpret HTTP slightly differently from each other. The TLS library on the client behaves slightly differently from the TLS library on the server. Some hops support HTTP/3, some only HTTP/1.1. Your gRPC payload is fine until it crosses a proxy that buffers in a way the client did not expect.

The fallacy is tempting because in development everything is the same Docker image. In production everything is not.

The failure mode is the protocol-mismatch bug, which is one of the most demoralising kinds: the system works, mostly, until exactly one combination of client version and proxy version interacts badly, and the result is a sporadic five-percent error rate that nobody can reproduce in staging. The architectural answer is interoperability discipline: stick to standard protocols, version your APIs, test against the actual proxies you will run behind, and assume that “homogeneous” is a development-environment fiction.

What can go wrong on one network call

Eight fallacies is a lot to keep in your head, so it helps to see them as eight ways a single request can fail.

flowchart TD
    A[Client makes one network call] --> B{What can go wrong?}
    B --> F1[Packet lost or duplicated]
    B --> F2[Round trip slower than expected]
    B --> F3[Payload too big for the link]
    B --> F4[Traffic intercepted or replayed]
    B --> F5[Destination has moved]
    B --> F6[Upstream dependency unavailable]
    B --> F7[Crosses a paid boundary]
    B --> F8[Proxy or library mismatch]
    F1 --> R[Architectural responses]
    F2 --> R
    F3 --> R
    F4 --> R
    F5 --> R
    F6 --> R
    F7 --> R
    F8 --> R
    R --> P1[Idempotency, retries, timeouts]
    R --> P2[Batching, locality, caching]
    R --> P3[mTLS, zero trust, tokens]
    R --> P4[Service discovery, healthchecks]
    R --> P5[Cost-aware topology]
    R --> P6[Versioning, contract tests]

Every architecture pattern in the rest of this course will land somewhere on the right-hand side of that diagram. Circuit breakers are a response to fallacy 1. Caching layers are a response to fallacies 2 and 3. Service meshes are a response to fallacies 4, 5 and 8. Multi-region design is a calibrated bet against fallacy 6 (different administrators, different blast radii) and a careful management of fallacy 7 (cost). When you read about a pattern and wonder why anyone bothered, the answer is almost always: because someone believed one of the eight fallacies and got burned.

How to use the list

You do not need to memorise the list as such. You need to internalise the habit of asking, before you write any code that crosses a process boundary: which of these am I about to assume?

A useful drill, especially in design reviews, is to take the proposed call graph and walk through it call by call. For each one: how does it behave when the response is slow? When the response never comes? When the response comes twice? When the destination has moved? When the link is saturated? When a proxy in between has a bug? When the bill comes? If the answer to any of those is “we have not thought about it,” that call is a future incident.

Most of the rest of the course is about answering those questions properly: with timeouts, idempotency, observability, service discovery, traffic shaping, contract testing, and the boring but essential discipline of treating the network as a hostile, lossy, probabilistic medium rather than a reliable function call.

What’s next

The fallacies tell you what is hard. They do not tell you what to do about it. The next two lessons introduce the central trade-off that organises the rest of distributed systems theory.

Lesson 10 is the CAP theorem in practice: in the presence of a network partition (fallacy 1, taken to its extreme), you must choose between consistency and availability for every operation. We will go through what CAP actually says, the most common misreadings, and the real choices that real systems (DNS, bank ledgers, Cassandra) make.

Lesson 11 extends that to PACELC, which adds the everyday case CAP is silent on: even when no partition is happening, you are trading latency for consistency on every write. Once you see the PACELC quadrant, the design space of distributed databases stops looking like a zoo and starts looking like a set of explicit, named choices.

We are now firmly in the territory where the lessons of Module 1 (start simple, defer complexity) meet the reality that some systems do need to scale beyond a single machine. Knowing the fallacies is what lets you architect the second machine without immediately regretting it.

Citations and further reading

Peter Deutsch, “The Eight Fallacies of Distributed Computing” (1994), summarised at https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing (retrieved 2026-05-01). The original Sun Microsystems memo is reproduced in several archived collections; the Wikipedia article links to the canonical PDFs.
Arnon Rotem-Gal-Oz, “Fallacies of Distributed Computing Explained” (2006), the most-cited expansion of the list with concrete examples (retrieved 2026-05-01).
AWS pricing pages for cross-AZ, NAT Gateway, and inter-region traffic, for the numbers used in fallacy 7 (retrieved 2026-05-01).
For the security fallacy: BeyondCorp papers from Google (2014 onward) and the SPIFFE specification (https://spiffe.io/) for the modern shape of zero-trust networking inside data centres.