Data & System Architecture, from the ground up Lesson 27 / 80

Partitioning: by key, by hash, by range

When one node can't hold the data, you split it. The three partitioning strategies and the queries each enables.

The previous two lessons were about replication: keeping more than one copy of the same data on more than one machine, so that a single failure does not lose the data and a single server’s capacity is not the read ceiling. Replication is for redundancy. It does not, by itself, let you store more data than fits on one machine, and it does not let you handle a write workload bigger than the leader can sustain. For those problems the technique is partitioning: splitting the data so that each machine holds a different subset.

The two are orthogonal, and most large databases do both at once. Each partition is replicated for durability and read scaling; each replica set is one of many partitions covering different parts of the data. Cassandra, DynamoDB, MongoDB sharded clusters, Elasticsearch, and most other distributed stores combine the two patterns this way. This lesson covers the partitioning side of that picture, on its own. Combine it with the previous lessons and you have the full view.

The decision a partitioning scheme has to make is, simply: given a piece of data, which node holds it? The data itself does not change; only its location does. Three families of strategy answer that question, and each one makes the same trade-off in a slightly different way: it optimises for one query shape and pays for it on another.

Range-based partitioning

The data is sorted by a key. The key space is divided into contiguous ranges, and each range goes to a partition. Partition 1 might cover keys A through G; partition 2 covers H through N; partition 3 covers O through Z. A row with key “Marco” lives on partition 2.

The strategy fits naturally onto a sorted structure, so it shows up in databases that already organise data by sorted keys. HBase, BigTable, and the original LevelDB-derived stores work this way. MongoDB sharded clusters can use range-based partitioning when the shard key is configured for it. Many time-series databases partition by time range, which is just range partitioning where the key happens to be a timestamp.

The advantage is range queries. If your application asks “give me all keys between J and L”, the database knows exactly which partition holds that range, sends the query there, and returns the result without touching any other partition. Time-bounded queries on a time-partitioned store are the canonical example: “give me all events from yesterday” hits one or two partitions, never more.

The disadvantage is hot partitions for skewed data. If your key is something like a user’s last name, the alphabet is not uniformly distributed: vastly more people have surnames starting with letters in the middle than letters near the end. The middle partition gets disproportionate traffic. If your key is a timestamp and writes are real-time, every new write goes to the most-recent partition, and that one partition takes the entire write load while older partitions sit idle. Range partitioning is excellent for the workloads it fits and bad for the workloads it does not.

The partition boundaries are usually adaptive: the database splits a partition when it grows too large and merges adjacent partitions when they shrink. This keeps the partitions roughly balanced in storage size, but it does nothing about access skew. A small partition that is hot is still hot.

Hash-based partitioning

Instead of partitioning by the key directly, partition by the hash of the key. The key “Marco” is hashed to some integer; the hash space is divided into ranges; each range goes to a partition. The partition for “Marco” is determined by where its hash falls, not where its alphabetical position falls.

This is the default in Cassandra, in DynamoDB, in most distributed key-value stores, and in MongoDB sharded clusters when the shard key is configured for hashing. Postgres has had hash partitioning natively since version 11.

The advantage is even distribution. Assuming the hash function is good, the keys are spread uniformly across the hash space regardless of how skewed the original keys are. The “everyone has surnames starting with M” problem disappears: the hashes of similar keys are scattered across the partition space, so no single partition becomes hot from key clustering alone. Write load and storage are balanced by design.

The disadvantage is the loss of range queries. The hash function shuffles keys, so consecutive keys in the original space land on different partitions. “Give me all keys between J and L” becomes a fan-out: the database has no way to know which partition holds which key without computing the hash, so it has to query every partition and merge the results. For workloads that do range queries on the partitioning key, hash partitioning is the wrong tool.

In practice, many systems support range queries on a secondary index even when the partitioning is hash-based, but those queries are still fan-outs across all partitions. The fan-out cost is the central trade-off.

A subtle related pattern is compound partitioning: hash on the first part of the key, sort on the second. Cassandra calls this “partition key plus clustering key”. The hash on the partition key spreads data evenly, and within each partition, rows are sorted by the clustering key, so range queries within a single partition are efficient. The query “all messages from user X between time T1 and T2” hits one partition (because the user ID is the partition key) and returns a sorted range within it. This is the workhorse pattern for time-series-on-Cassandra and for chat-message stores.

Consistent hashing and the by-key view

The third pattern is closely related to hash-based partitioning but framed differently. Imagine a circular hash space, the hash ring. Each node in the cluster is assigned one or more positions on the ring (often called virtual nodes or vnodes). A key, when hashed, lands at some point on the ring; the node that owns it is the next one clockwise from that point.

This is consistent hashing, and the property that makes it useful is its behaviour when nodes join or leave. Adding a node assigns it a new set of positions on the ring; only the keys that fall between the new node’s predecessor and the new node have to move. Removing a node hands its keys to the next node clockwise. In both cases, only a fraction of the keys (proportional to the number of nodes) is rebalanced, instead of the catastrophic re-hashing that naive modulo-based partitioning would require.

flowchart TB
    subgraph H[Hash partitioning]
        K1[key X] --> Hash[hash function]
        Hash --> P1[Partition 1]
        Hash --> P2[Partition 2]
        Hash --> P3[Partition 3]
        P1 --> Note1[Even distribution, fan-out range queries]
    end
    subgraph R[Range partitioning]
        K2[key X] --> Lookup[range table]
        Lookup --> P4[Partition A-G]
        Lookup --> P5[Partition H-N]
        Lookup --> P6[Partition O-Z]
        P4 --> Note2[Range queries fast, hot partitions on skew]
    end

Consistent hashing is the partitioning mechanism in Cassandra, DynamoDB, Riak, the distributed memcached/Redis cluster modes, and most peer-to-peer systems. The original 1997 paper from Karger et al. was about distributed caches, but the technique generalised to nearly every modern distributed key-value store.

The choice of partition key

The strategy matters less than the partition key. A bad key on a good strategy creates more pain than a good key on a mediocre strategy. The two failure modes worth naming.

Hot partitions from key skew. A partition that holds a single popular value, or a small set of popular values, takes a disproportionate share of the traffic. The classic example is partitioning by user_id in a system where one user (a celebrity, a brand account, a bot) generates more traffic than thousands of normal users combined. The partition holding that user’s data is overloaded; the other partitions are underutilised. The same shape happens with viral posts, hot products on an e-commerce site, and active sessions on a chat application.

The fixes are workload-specific.

The first fix is a better partition key. If user_id alone creates hot partitions, use (user_id, time_bucket) or (user_id, message_id) as a compound key. Now a celebrity user’s data is spread across many partitions instead of one, at the cost of needing to query more partitions to read all of one user’s data.

The second fix is salting: prepend or append a random suffix to the partition key for hot values, so the same logical entity is split across multiple physical partitions. Reads have to query all the salted variants and merge, but writes are distributed. Cassandra’s documentation has detailed treatment of this pattern, and DynamoDB’s partition-key best-practices guide is largely about avoiding hot partitions and applying salting when avoidance fails.

The third fix is caching the hot data out of the partition. If the celebrity user’s profile is read a million times per minute, it does not need to live in the partitioned store on every read. Cache it in Redis or in an edge CDN, and the partition only sees the cache misses. This is often the cheapest fix.

Cross-partition queries. A query that needs data from many partitions has to fan out: query each partition, gather the results, merge them, possibly sort or aggregate. The latency of the query is bounded by the slowest partition, not the average, and the cost scales with the number of partitions touched. “Query the user’s own data” should be a one-partition query if the user ID is part of the partition key. “Aggregate metrics across all users” is unavoidably a fan-out. The first kind should be a millisecond-scale operation; the second kind is going to be hundreds of milliseconds at best, and in some systems that justifies pushing the aggregation to a separate analytics store entirely (the polyglot persistence argument from lesson 24).

The general principle: pick the partition key to align with the most common query pattern. If most reads are “get all data for user X”, make the user ID the partition key. If most reads are “get all events in time range T”, make the timestamp the partition key. If both patterns are common and they pull in different directions, you have a tension that no partitioning strategy will resolve neatly, and you may need a secondary system (a search index, an analytics warehouse) to serve the second query shape.

The rebalancing problem

When nodes join or leave, partitions have to move. A new node arrives empty and needs to take over its share of partitions. A failing node’s partitions need to be redistributed to the survivors. Done well, this is invisible to the application: the database moves data in the background, queries continue against the partitions they should hit, and the cluster reaches a new balanced state without anyone noticing.

Done badly, rebalancing is a multi-day operational event. The classical bad case is a partitioning scheme that does not support incremental rebalancing: adding one node forces every key to be re-mapped, every byte to be re-shipped, and the cluster spends the duration of the rebalance running at significantly reduced capacity. Consistent hashing was specifically invented to avoid this, and the fact that we still occasionally see the bad case in production is usually a sign of a misuse of the partitioning strategy or a poorly-managed migration.

The patterns to know.

Fixed number of partitions, far more partitions than nodes. Cassandra (with a high vnode count), Riak, and several others maintain a fixed number of logical partitions and assign them to physical nodes. Adding a node does not change the partition count, only the assignment, so only the keys in the moved partitions migrate.

Dynamic partitioning. HBase, BigTable, and MongoDB sharded clusters split partitions when they grow too large and merge them when they shrink. The partition count adapts to the data volume. Adding a node triggers redistribution, but the migration is done partition by partition, not key by key.

Per-node fixed share. Some systems assign a fixed share of partitions per node (1/N when there are N nodes). Adding a node forces 1/(N+1) of every existing node’s data to move. Simple to reason about, expensive when nodes join frequently.

The operational lesson is that rebalancing is the expensive operation in the lifecycle of a partitioned system, and the rate at which you can scale up or scale down is bounded by how fast the system can rebalance without saturating the network. If you expect to add capacity quickly, this is a property to test under load before you need it.

Replication and partitioning together

To close the loop with the previous two lessons: in a real production deployment, each partition is itself replicated. Cassandra’s standard configuration is N=3 replicas per partition, RF=3 across the cluster. DynamoDB replicates each partition across three availability zones. MongoDB sharded clusters wrap each shard in a replica set. The patterns combine as a two-dimensional grid: partitions across one axis, replicas across the other.

The consistency story is also two-dimensional. Replication has its own consistency mode (synchronous, asynchronous, quorum-based, leader-based). Partitioning has cross-partition transactional questions of its own (lesson 15’s two-phase commit, lesson 14’s consensus). A query that spans partitions and replicas is making consistency choices on both axes, and the database’s overall guarantees are the product of those choices.

The next lessons in this module pick up that thread. Lesson 28 takes the hot-partition problem deeper, into the patterns and anti-patterns that production systems use to manage skew. Later lessons turn to consistency across partitions: distributed transactions, Saga patterns, and the trade-offs that show up when a single user-visible operation has to touch data across multiple partitioned and replicated stores.

Citations and further reading

  • Martin Kleppmann, Designing Data-Intensive Applications (O’Reilly, 2017), Chapter 6. The reference treatment of partitioning, with the same hash-versus-range trade-off framing used here.
  • David Karger et al., “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web”, STOC 1997. The original consistent-hashing paper, which is more readable than its title suggests.
  • AWS DynamoDB Developer Guide, “Designing Partition Keys to Distribute Workload Evenly”, https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html (retrieved 2026-05-01). Practical guidance on partition-key design, hot-partition avoidance, and write sharding.
  • Cassandra documentation, “Data Modeling: Partition Keys and Clustering”, https://cassandra.apache.org/doc/latest/cassandra/data_modeling/index.html (retrieved 2026-05-01). The compound-key pattern, with clustering keys for in-partition ordering.
  • MongoDB documentation, “Sharded Cluster Components” and “Choose a Shard Key”, https://www.mongodb.com/docs/manual/sharding/ (retrieved 2026-05-01). Range and hash sharding in a system that supports both.
  • Fay Chang et al., “Bigtable: A Distributed Storage System for Structured Data”, OSDI 2006. The canonical range-partitioned design.
Search