Cross-shard queries: fan-out vs co-location

The previous three lessons walked through the mechanics of partitioning a database across many machines. Hash sharding, range sharding, the rebalancing problem, the hot-key problem. By the end of lesson 30 you had a sharded cluster and a defensible shard key. This lesson is about the consequence: now that your data is spread across N machines, every query you run has a cost that depends on how many shards it has to touch, and the entire architecture is built around keeping that number as small as possible. Usually one.

The basic fact, stated bluntly: a sharded database is fast on one-shard queries and slow on N-shard queries. A query that lands on a single shard runs at single-shard speed, which on Postgres or MySQL is a millisecond or two for a primary-key lookup. A query that has to fan out to all shards runs at the speed of the slowest shard, plus the overhead of merging the results. If the shards are on the same network, that fan-out is single-digit milliseconds. If they are across regions, it is hundreds. Either way, it scales worse than a one-shard query, and as the cluster grows it gets worse, not better.

The whole game of designing a sharded system is keeping queries on one shard whenever you possibly can. The two strategies for doing that are co-location and fan-out, and they are complementary rather than competing.

Co-location: arrange data so the query stays on one shard

Co-location is the proactive strategy. You arrange your data layout so that any rows a typical query needs to look at together live on the same shard. The shard key is the lever. Pick it so that the entities that are queried together share a key.

The classic and almost universal choice for SaaS applications is shard by user_id. A user’s profile, their orders, their settings, their sessions, their notifications, their uploaded files: all of it gets the user’s ID as part of the shard key, which means all of it lives on the same shard. Any query that filters by user_id (which is almost every query in a user-centric app) hits exactly one shard.

This works because user-centric workloads have a property that lines up perfectly with the sharding model. Most queries are scoped to a single user. My orders are independent of yours: there is no normal query that needs to join my orders to your orders. The data has a natural partitioning along the user_id axis, and the application’s access pattern respects it.

The same logic generalises. A multi-tenant B2B SaaS shards by tenant_id rather than user_id, because the natural unit of locality is the tenant, not the individual user. A messaging app shards by channel_id (the Discord case study in the next lesson is exactly this). A social game shards by world_id or instance_id. The pattern is the same: identify the entity that owns most of the data and most of the queries, and use its ID as the shard key.

When you get the shard key right, the system feels almost like a non-sharded database. Queries are fast, latency is predictable, the cluster scales linearly with the number of users (or tenants, or channels). The fact that there are fifty machines under the hood is invisible to the application code, which just filters by user_id and gets one-shard performance every time.

Fan-out: when you genuinely cannot avoid touching every shard

Some queries cannot be made to live on one shard, no matter how clever your shard key is. “How many users have logged in today across the entire platform” cannot be answered from one shard, because logins are spread across all of them by user_id. “Show me all orders placed in the last hour, sorted by value” cannot be answered from one shard for the same reason. Anything that aggregates across users, or filters by a non-shard-key field, or does global ordering, has to look at every shard.

For these queries the strategy is fan-out. The query coordinator (sometimes the application, sometimes a middle-tier router, sometimes the database itself) sends the query to every shard in parallel. Each shard executes its part locally and returns its result. The coordinator merges the partial results into the final answer.

The cost model for fan-out is straightforward. The latency is the latency of the slowest shard plus the merge cost. Throughput is divided by N, because each query now consumes a slot on every shard rather than just one. If you do too many fan-out queries, the cluster is effectively running at 1/N of its nominal capacity. This is why fan-out has to be the exception, not the rule.

When the partial results are large (a global sort on millions of rows), the merge step itself becomes expensive. Sometimes the merge is offloaded to a separate machine. Sometimes the application accepts a top-K approximation rather than a true global sort. Sometimes the query is rewritten to use a different store entirely, which is the path to the analytics warehouse covered in lesson 65.

flowchart LR
    Q1[Query: orders for user 42] --> R1[Router]
    R1 -->|hash user_id 42| S1[(Shard 7)]
    S1 --> Resp1[1ms response]

    Q2[Query: top 100 orders today] --> R2[Router]
    R2 --> SA[(Shard 1)]
    R2 --> SB[(Shard 2)]
    R2 --> SC[(Shard ...)]
    R2 --> SD[(Shard N)]
    SA --> M[Merge top-K]
    SB --> M
    SC --> M
    SD --> M
    M --> Resp2[N-shard latency + merge]

The two query patterns side by side. The one-shard query is a single hop. The fan-out query is N hops in parallel followed by a merge. The cost difference is what makes the shard-key choice the most consequential decision in the system.

Why “shard by user_id” became the religion

Three reasons, all worth saying out loud because the choice is so reflexive in modern SaaS that engineers often forget there was a decision at all.

User-centric workloads have user-scoped queries. The vast majority of queries in a SaaS application start with “for this user” or “in this tenant”. When the data is laid out so that each user’s data lives on one shard, those queries hit one shard. The architecture compounds with the workload.

User entities are independent. My data does not have to be joined to your data for any normal feature. A user’s orders, profile, and history are self-contained. There is no inherent relational integrity that crosses the user boundary, so the loss of cross-user joins is not painful.

Reporting can be moved offline. The queries that genuinely need a global view (analytics dashboards, business intelligence, fraud detection across users) are not latency-critical. They can be run against a separate analytics store that ingests data from all shards through change-data-capture. The transactional sharded cluster is left alone to do its one-shard-query job. Lesson 65 covers this pattern in detail.

The combination means that for a user-centric SaaS, sharding by user_id is almost free. Almost every query is naturally on one shard, the few that are not can be moved to an analytics warehouse, and the cluster scales to the number of users without architectural surprises.

When user_id sharding does not work

Three situations, all worth recognising.

Multi-tenant data with cross-user views inside a tenant. A B2B SaaS where one tenant has many users, and an admin in that tenant wants to see all activity from all their users, will be unhappy with user_id sharding. The admin’s query is naturally scoped to a tenant, but the data is spread across many shards (one per user). The fix is to shard by tenant_id instead. Now an admin query is one shard, and a user query is also one shard (because the user’s data lives on the tenant’s shard). The trade is that a very large tenant becomes a hot partition, which is the problem lesson 28 covered.

Cross-user features. Chat between two users. A friend graph. A like on someone else’s post. These features inherently cross the user boundary, and no shard key keeps them on one shard. The pragmatic answers: store messages twice, once on each participant’s shard, so any query about “my messages” stays on my shard. Or store the cross-user data in a separate, smaller, non-sharded store dedicated to the relationship table. Or use event-driven materialised views that pre-compute the cross-user joins into per-user inboxes. All three are common; none is free.

Global lookups by something other than user_id. “Find the user with email alice@example.com” is a global query when sharded by user_id, because the email lives on whatever shard owns that user, and you do not know which one without a separate index. The standard fix is a small global lookup table (email -> user_id) maintained by every shard write, or kept in a non-sharded auxiliary store. The lookup table is small enough to live on one machine, the rest of the data is sharded for scale.

The general principle: when you identify a query that does not fit the shard key, you have three honest options. Replicate the small reference data to every shard so any shard can answer locally. Build a denormalised secondary index (often in a separate store) so the alternate access path is also a one-shard lookup. Accept fan-out for that specific query and ensure it is rare enough that the cluster can absorb it.

Replicating reference tables

A pattern worth naming. Some tables are small, change rarely, and are joined by every query: country codes, currency rates, feature-flag definitions, product categories, plan limits. If you shard a table like this, every join becomes a fan-out, which ruins everything.

The fix is to replicate the table to every shard. Every shard holds a full copy. Joins are local. Updates are written to every shard, which is acceptable because they are rare. Postgres-flavour systems sometimes have explicit support for this: Citus calls them “reference tables”, Vitess has a similar concept. In a hand-rolled sharded application, the migration system pushes schema and data changes to all shards.

Materialised views for the awkward query

The other escape valve. When a query genuinely cannot live on one shard with the natural data layout, you precompute the answer into a structure that does.

The classic example is a feed. User A’s feed combines posts from users B, C, D, E. If the data is sharded by user_id, building A’s feed requires fan-out across however many users A follows. The fix is to maintain A’s feed as its own table, sharded by A’s user_id. When B posts, an event writes a denormalised copy into the feed tables of everyone who follows B. The feed-build cost is paid at write time, distributed and amortised. The read is a one-shard lookup.

This is the fan-out-on-write pattern, and it is what every large-scale social product uses for feeds. Twitter has been public about doing exactly this. The cost is real: B’s post might be written to a million followers’ feed tables. The compensation is constant read latency regardless of how many users B has. Module 5 covers the event-driven plumbing that makes this operational.

The performance numbers, concretely

A one-shard query on Postgres against a primary key index: 0.5 to 2 ms, including the local network round-trip. The routing layer adds at most a millisecond on top.

A 10-shard fan-out on the same network: roughly 5 to 10 ms, dominated by the slowest shard plus a millisecond of merge. Acceptable occasionally, expensive on every page view. A 100-shard fan-out: 10 to 30 ms in the best case, much worse if any single shard is slow.

A cross-region fan-out, where shards live in different geographic regions: 100 to 300 ms minimum, bounded by inter-region network latency. This is almost always too slow for an interactive query. The standard answer is geo-sharding by region key and accepting that global queries are not interactive. These numbers are the reason the architecture is built around one-shard queries.

The shape of a healthy sharded application

To close: what does it look like when sharding is done right?

The shard key matches the dominant query pattern. Almost every hot-path query filters by the shard key explicitly, so the routing layer can forward to the target shard directly. A small number of reference tables are replicated to every shard for local joins. A small number of cross-shard queries exist; they are either rare enough to fan out, or moved to materialised views that turn them back into one-shard reads, or moved to an analytics store that ingests CDC from all shards.

When the application grows, more shards are added. Hot-path queries continue to be one-shard. The cluster’s capacity grows linearly with the number of shards. This is what people mean when they say “sharding works”. It is the result of getting the shard key right and treating cross-shard queries as the rare exception they have to be.

Lesson 32, the final lesson of Module 4, walks through Discord’s storage migration in detail. It is the textbook example of a system whose architecture was built around one query pattern and one shard key, and whose migrations across three different databases all preserved that same shape.

Citations and further reading

“Designing Data-Intensive Applications” (Martin Kleppmann, O’Reilly, 2017), chapter 6. The standard reference for partitioning, shard keys, and cross-partition queries.
Citus documentation, “Reference Tables”, https://docs.citusdata.com/en/stable/develop/reference_tables.html (retrieved 2026-05-01). The operational treatment of replicated reference tables in a Postgres-sharding product.
Vitess documentation, https://vitess.io/docs/ (retrieved 2026-05-01). MySQL sharding with explicit support for the patterns this lesson discusses.
Twitter Engineering, “The Infrastructure Behind Twitter: Scale”, https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale (retrieved 2026-05-01). Public discussion of fan-out-on-write for feed delivery at scale.