Split brain: what it is and why it ruins everything

Two lessons of partitioning and sharding behind us, this one is about the failure mode that lives quietly underneath all of it. The cluster has a leader. The leader is replicating to followers. The network is mostly working. Then it is not. A switch fails, a routing table updates badly, a cable is unplugged in the wrong rack, a software bug in the network fabric isolates one zone from the others. The cluster is partitioned: not data partitioned, network partitioned. Some nodes can talk to each other and not to the rest. The leader is on one side, some followers are on the other side, and what happens next is the topic.

The headline name for the worst case is split brain. It is the name for the situation where both halves of the cluster believe they are in charge, both accept writes, and both end up with state that the other half does not have. When the network heals and the two halves try to merge, the database has divergent histories, and there is no automatic way to reconcile them without losing data or violating invariants. This lesson is about how split brain happens, why it is uniquely catastrophic among failure modes, and the few mechanisms that reliably prevent it.

The scenario

Consider a three-node cluster: one leader, L, and two followers, F1 and F2. Clients send writes to the leader; the leader replicates to the followers; everybody is happy.

Now a network partition cuts the cluster: L is isolated on one side; F1 and F2 are together on the other. From F1 and F2’s point of view, the leader has gone silent. Heartbeats are missing. They wait their election timeout, decide the leader is dead, and elect F1 as the new leader. From their side of the partition, this is exactly what they were supposed to do.

From L’s point of view, the followers have gone silent. L is still receiving heartbeats from itself and from any clients still on its side of the network. Crucially, L still believes it is the leader. It is still accepting writes from clients. The clients on its side of the partition are getting successful responses to their requests, completely unaware that anything is wrong.

Now the network heals. F1 and F2, with F1 as the new leader, have a stretch of writes that L does not have. L has a stretch of writes that F1 and F2 do not have. The two halves do not just disagree on history: they each have a list of acknowledged writes that the other has never seen, with overlapping keys, with conflicting values, with no way to merge them without picking which writes to lose.

sequenceDiagram
    participant Client1 as Client (left side)
    participant L as Leader L
    participant F1 as Follower F1
    participant F2 as Follower F2
    participant Client2 as Client (right side)
    Note over L,F2: Network partition isolates L
    Client1->>L: write A
    L->>L: accept A (still thinks leader)
    F1->>F2: heartbeat timeout
    F1->>F2: elect F1 as new leader
    Client2->>F1: write B
    F1->>F2: replicate B
    Note over L,F2: Partition heals
    L->>F1: hello, I am leader
    F1->>L: no, I am leader (term incremented)
    Note over L,F1: A and B are conflicting writes; merge is undefined

This is split brain. Both sides did exactly what they were designed to do; the system as a whole produced state that the database cannot reconcile.

Why split brain is uniquely catastrophic

Most distributed-system failures are recoverable. A node crashes: bring it back, replay the log, you are caught up. A replica falls behind: catch up by streaming the log. A leader dies: elect a new one, the old one was not accepting writes anyway. Even Byzantine failures (where a node tells lies) can be tolerated by protocols that are designed for it.

Split brain is different. Both sides accepted writes that the other side did not see. Both sides told their clients those writes had succeeded. The clients have moved on. When the partition heals, the database has two equally-valid histories that disagree on the same keys, and no mechanism to choose between them without breaking promises that have already been made.

The classic illustrative example is a financial ledger. A user has 100 dollars. The user issues a withdrawal of 60 dollars to the left side of the partition; the left leader authorizes it, balance is now 40. The user, frustrated by slow response from somewhere, also issues a withdrawal of 60 dollars to the right side of the partition; the right leader, with a stale copy of the balance still showing 100, authorizes it too, balance is now 40 on its side. When the partition heals, the database has authorized 120 dollars of withdrawals from a 100-dollar account. There is no automatic merge that recovers; the bank has lost 20 dollars.

The same shape applies to any system with invariants that depend on a global view of the data: inventory (“only one of this item exists, it cannot be sold twice”), unique constraints (“this username is taken”), counters (“this resource was acquired by exactly one client”), idempotency keys (“this operation must run at most once”). For all of these, two leaders accepting conflicting writes is the failure mode that violates the property the system was supposed to provide.

Quorum is the only reliable defense

The defense, the only reliable defense, is to require a majority of nodes to agree before any write can be accepted. With three nodes, you need two for a quorum. A 1-2 partition: the side with two nodes can still form a majority and accept writes; the side with one cannot. The lone node sees that it cannot reach a majority of its peers, refuses to accept writes, and either sits idle or returns errors to its clients. There is no possibility of two leaders simultaneously accepting writes, because there is no possibility of two majorities simultaneously existing in a partition of a three-node cluster.

This is exactly why Raft and Paxos require majority votes for leader election and for log commit (lesson 14). The majority requirement is not an arbitrary design choice; it is the only quorum rule that makes split brain mathematically impossible. Any two majorities of N nodes must overlap in at least one node, and that overlapping node’s “I voted for this leader” or “I committed this log entry” enforces a single source of truth across the partition.

The deployment recipe that follows from this is straightforward and easy to forget under cost pressure: run odd-numbered cluster sizes, three at minimum, five for any system you care about, seven if you really want to ride out multiple simultaneous failures. Even-numbered cluster sizes are operationally worse than odd-numbered sizes. A four-node cluster split 2-2 has no majority anywhere, so the entire cluster is unavailable for writes during the partition. Going from three to four nodes does not improve availability; it makes it worse. Five nodes survives two simultaneous failures and partitions cleanly into 3-2 splits with a majority on the larger side.

Two-node clusters are the deployment that gets you bitten the most often. Two nodes have no majority when partitioned: each side has exactly one node, and one is not a majority of two. So either you accept that the cluster is unavailable for writes during any partition (and you might as well have run a single node, which has the same availability profile and half the operational cost), or you allow each node to act on its own when partitioned, and now you have just designed in a split brain. Do not run two-node clusters for anything that matters. Either run one node and accept the single-point-of-failure honestly, or run three nodes and get the actual availability benefit.

Fencing: the second line of defense

Quorum prevents two simultaneous leaders. It does not prevent a leader from briefly believing it is still the leader after it has lost quorum. The scenario: a leader gets isolated from its followers, the followers elect a new leader, but the old leader has not yet noticed it is partitioned and continues to send writes to downstream services. The downstream services do not know about the consensus protocol; they see writes from a node claiming to be the leader and accept them.

The fix is fencing tokens. The consensus system attaches a monotonically increasing token (often called a term, or an epoch, or a fencing number) to every leadership grant. Every write the leader sends to a downstream service includes this token. Downstream services remember the highest token they have ever seen, and reject any write that arrives with a lower token. When the new leader is elected, its token is higher than the old leader’s. When the old leader, still operating on its stale belief, sends a write to a downstream service with its old token, the service rejects it: “your token is 7, I have already seen token 8 from somebody else.” The old leader is fenced out.

Fencing is the right answer for the gap between “lost quorum” and “noticed I lost quorum.” It also covers the slower-leader case: a leader pauses for garbage collection or a swap-out for fifteen seconds, the cluster decides it is dead and elects a new one, and when the old leader resumes it tries to write as if nothing happened. Without fencing, the downstream services accept its writes; with fencing, they reject them.

Implementing fencing requires the downstream services to be aware of the token, which means coordinating fencing across application services and external dependencies. Some systems do this well; many do not. When you read about a “we lost data because of a split brain” postmortem, the missing layer is usually fencing tokens at the storage or external-service boundary.

Real-world failure patterns

Specific incident reports vary, but the shapes recur across many years and many systems. The patterns worth recognising:

DNS-based failover with too-long TTL during a partition. A primary database fails, the operations team flips a DNS record to point to the replica, and traffic begins moving to the replica. But clients have cached the old DNS record for the TTL duration; they continue talking to the old primary, which then comes back online and starts accepting writes again, while other clients have moved to the new primary. Two databases, both accepting writes, glued together by a slow DNS rollover.

Virtual IP failover with a network-layer split brain. Heartbeat-based HA tools (Pacemaker, Keepalived, the older HA stacks) move a virtual IP from a failed primary to a standby. If the heartbeat link between the two nodes fails but the public network is still working, both nodes can decide they are the active node and both can claim the same VIP. The clients on each side of the network see different “primaries” with the same address.

Asynchronous replication with manual failover and a sloppy rejoin. A primary fails, the team promotes a replica, the original primary comes back, somebody runs a “rejoin to the cluster” command without first checking that the original primary did not have writes that did not replicate before it failed. Those writes are silently dropped, or worse, silently merged in a way that creates duplicates.

Two-node clusters with no consensus. The classic recipe: a primary and a replica with mutual heartbeat. Heartbeat link fails, both nodes promote themselves, both accept writes for the duration of the partition, and the rejoin merges the divergent histories with whatever heuristic the operator picks at 3am.

In each case the underlying error is the same: there was no majority requirement, no fencing token, no consensus protocol authoritatively deciding “this node is the leader and only this node.” Split brain is what happens when leadership is decided by something less rigorous than a quorum.

The mitigations

A small set of practices prevents split brain reliably.

Use a real consensus system for cluster membership. Raft (etcd, Consul, ZooKeeper’s ZAB variant) is the modern default. The cluster’s membership and leadership are managed by a consensus protocol, with majority quorum, with monotonic terms. Do not invent your own.

Run odd-numbered clusters of three or more. Five is better than three for production. Two is worse than one. Four is worse than three.

Implement fencing tokens at the boundary. When the leader writes to external storage or an external service, attach the token. When the storage or service receives the write, check the token against the highest seen.

Use STONITH for legacy HA. Shoot The Other Node In The Head: when a node loses its lease, hardware-level fencing forcibly powers it off so it cannot accept writes during the gap. This is the older, brutal, and effective answer for high-availability clusters built on shared storage.

Do not run anything important on a two-node cluster. Either accept single-node operation or move to three.

What this sets up

Split brain is the headline failure mode of distributed systems: the one with the worst consequences and the one most often misunderstood. The defenses (quorum, fencing, sane cluster sizes) are all well-known, but every year teams ship systems that lack them and pay the bill when their network has a bad day. The Jepsen test suite, run by Kyle Kingsbury (Aphyr), has spent over a decade publishing detailed reports of distributed databases failing under partition; many of the failures reduce to “the system did not actually have quorum where it claimed to.”

This lesson covered the dramatic failure mode. The next lesson covers the more common day-to-day pain: cross-shard queries. Most workloads do not see split brain; they see “this query has to talk to four shards and one of them is slow” every minute of every day. The strategies for living with that ordinary reality are different, and that is where lesson 31 picks up.

Citations and further reading

Martin Kleppmann, “Designing Data-Intensive Applications” (O’Reilly, 2017), Chapter 9 (consistency and consensus). The book-length treatment of split brain, fencing, and quorum.
Kyle Kingsbury (Aphyr), “Jepsen”, https://jepsen.io/analyses (retrieved 2026-05-01). A decade of analyses of distributed databases failing under partition. Pick almost any report for a concrete case study.
Diego Ongaro and John Ousterhout, “In Search of an Understandable Consensus Algorithm”, USENIX ATC 2014, https://raft.github.io/raft.pdf (retrieved 2026-05-01). The Raft paper, which introduces terms (the fencing tokens of the protocol) explicitly.
Pacemaker documentation, “Fencing and STONITH”, https://clusterlabs.org/pacemaker/doc/ (retrieved 2026-05-01). The legacy HA reference for hardware fencing.
Lesson 14 of this series (consensus, Paxos, and Raft) and lesson 16 (delivery semantics and idempotency) for the building blocks this lesson assumes.