The previous two lessons set up the world: a partition is a chunk of rows assigned to one task, and Spark cares deeply about how many of them you have and how evenly they’re packed. This lesson is about the two operators that let you change that on purpose: repartition and coalesce.
They look interchangeable. They’re not. One is a sledgehammer that triggers a full shuffle and gives you exactly what you asked for. The other is a screwdriver that combines partitions in place — cheap, but with a sharp edge that can quietly serialize your entire upstream pipeline to a single task. People reach for coalesce(1) to “just write one file” and wonder why the job that used to finish in two minutes now takes forty.
This lesson explains both, when to use which, and the gotcha that gets everyone the first time.
repartition(N): the full shuffle
df.repartition(N) rebuilds the DataFrame into exactly N partitions, evenly distributed. It does this with a full shuffle: every row goes through the network (or local disk in single-machine mode), gets hashed into a target bucket, and lands in a new partition.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.appName("RepartitionVsCoalesce")
.master("local[*]")
.getOrCreate())
events = spark.range(0, 1_000_000).select(
F.col("id").alias("event_id"),
(F.col("id") % 1000).alias("user_id"),
F.lit("data").alias("payload"),
)
print(events.rdd.getNumPartitions()) # likely 8 on a local[*] machine
big = events.repartition(50)
print(big.rdd.getNumPartitions()) # 50
Two things to notice. First, repartition works in either direction — you can go from 8 to 50, or from 200 down to 10. Second, the resulting partitions are uniform: roughly the same number of rows in each, regardless of how unbalanced the input was. That uniformity is the whole reason repartition is expensive — to get it, every row has to be re-routed through a shuffle.
Look at the plan:
big.explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(50), REPARTITION_BY_NUM, [plan_id=...]
# +- Range (0, 1000000, ...)
Exchange is the Spark word for “shuffle here.” RoundRobinPartitioning(50) means rows are dealt out across 50 buckets in round-robin fashion — no key, just spread them evenly. That’s the default for repartition(N).
repartition(N, *keys): hash-partition by a column
There’s an overload that takes columns:
hashed = events.repartition(50, "user_id")
hashed.explain()
# == Physical Plan ==
# Exchange hashpartitioning(user_id#3L, 50), REPARTITION_BY_NUM, ...
# +- Range ...
Now Spark hashes each row’s user_id, mods by 50, and uses that as the partition number. Same total work as repartition(50) — still a full shuffle — but the rows are now grouped: every event for user_id = 42 lands in the same partition.
This is useful right before a join or a window function on that key. Spark would shuffle anyway to colocate matching keys; if you do it explicitly first, you control the partition count and you can cache the result and reuse it across queries. Lesson 36 will show how the same idea, persisted to disk, becomes bucketing.
A trap: hash-partitioning by a skewed key reproduces the skew exactly. If user_id = 1 has a million rows and the median user has a hundred, repartition(50, "user_id") will hash every row for user 1 into a single partition. Lesson 28’s skew diagnosis applies here — repartition(N, key) is not a skew fix, it’s a partition-count-and-key-distribution control.
repartitionByRange(N, *keys): ordered output
Less commonly used, useful when you want roughly-sorted output:
ranged = events.repartitionByRange(50, "event_id")
Spark samples the data, picks 50 range boundaries on event_id, and routes rows to whichever range bucket they fall into. The result is partitions that are individually sortable and globally roughly ordered. This is what df.sort() uses under the hood when you sort across the whole DataFrame. Use it when you want to write Parquet files where each file covers a contiguous range — handy for partition pruning when the storage layer doesn’t support partitionBy on continuous columns.
coalesce(N): the cheap shrink
df.coalesce(N) is the other tool. It does not shuffle. Instead, it merges existing partitions in place:
shrunk = events.coalesce(2)
print(shrunk.rdd.getNumPartitions()) # 2
shrunk.explain()
# == Physical Plan ==
# Coalesce 2
# +- Range (0, 1000000, ...)
No Exchange in the plan. Spark just decides that, instead of running 8 tasks each on their own partition, it’ll run 2 tasks where each task reads from 4 of the original partitions in sequence. The data didn’t move; the boundaries did.
This is much cheaper than repartition. It’s also weaker:
coalescecan only reduce partition count. Asking forcoalesce(50)on an 8-partition DataFrame gives you 8 partitions back, no error. To increase, you needrepartition.- The resulting partitions can be uneven. If your 8 input partitions were unbalanced,
coalesce(2)produces 2 partitions whose sizes are the sum of whichever 4 originals got merged together. No rebalancing happens. - And the big one — the upstream gotcha.
The coalesce(1) upstream gotcha
This is the trap. df.coalesce(1) doesn’t just put the final result into one partition — it pushes that constraint up the DAG, often making the entire upstream pipeline single-threaded.
Consider:
result = (spark.read.parquet("/data/big-input") # 200 partitions
.filter(F.col("country") == "IT")
.withColumn("year", F.year("dt"))
.groupBy("year").count()
.coalesce(1) # because we want 1 file
.write.parquet("/data/output"))
Intuition says: read 200 partitions in parallel, filter, group, then merge the result into one file at the end. What actually happens: Spark sees coalesce(1) and propagates the parallelism backward as far as it can without crossing a shuffle boundary. The filter, the withColumn, and the read all run with parallelism 1. Two hundred input partitions are read serially by a single task. The job that used to take two minutes now takes forty.
The reason: coalesce is a narrow transformation (lesson 21). Spark doesn’t insert a shuffle to satisfy it; instead it absorbs the new partition count into the previous stage. If you coalesce(1), the previous stage now runs with one task. If there’s a shuffle further upstream — like the groupBy in the example — the back-propagation stops there, because the shuffle is a stage boundary. So in this case the groupBy and earlier might still parallelize, depending on how Spark plans the stages, but the post-groupBy work is single-threaded.
The fix is to use repartition(1) instead of coalesce(1) whenever the cost of a final shuffle is cheaper than the cost of running upstream serially:
result = (spark.read.parquet("/data/big-input")
.filter(F.col("country") == "IT")
.withColumn("year", F.year("dt"))
.groupBy("year").count()
.repartition(1) # full shuffle, but upstream stays parallel
.write.parquet("/data/output"))
Now the read, filter, and groupBy all use as much parallelism as Spark wants, and the final shuffle to one partition is a one-time cost on a tiny aggregated result. Total wall clock: back to two minutes.
The rule of thumb: coalesce(N) is safe when N is close to the existing partition count and N is large enough that the upstream stage is still happy with that level of parallelism. coalesce(1) is almost never what you want unless the upstream is already cheap.
Worked example: the small-files problem
A common reason to change partition count is the small-files problem at write time. Imagine a job whose final stage has 5,000 partitions because of an upstream shuffle. If you write that out as Parquet, you get 5,000 tiny files — terrible for downstream readers, slow to list, and a lot of S3 PUT costs.
You want maybe 50 files of reasonable size. Two options:
# Option A: coalesce — cheap, but watch upstream
final.coalesce(50).write.parquet("/data/out")
# Option B: repartition — more expensive, but uniform and parallelism-safe
final.repartition(50).write.parquet("/data/out")
Option A skips the shuffle. Tasks merge 100 input partitions each (5,000 / 50). The previous stage runs with 50 tasks instead of 5,000 — usually still fine, sometimes a problem if the upstream was already CPU-bound and 50 isn’t enough cores worth of parallelism.
Option B does a full shuffle. More expensive, but the upstream stage runs with its original 5,000 partitions of parallelism, and the output partitions are uniform.
For “shrink from N to N/100” the answer is usually coalesce. For “shrink from N to a single file or two” the answer is almost always repartition. For “shrink from N to N/10 but I’m worried about the upstream stage being slow” — measure both. Spark 3.x’s AQE can also coalesce shuffle partitions automatically, which often removes the need to do this by hand. Lesson 59 covers AQE.
Quick reference
| Want this | Use | Triggers shuffle? |
|---|---|---|
| Exactly N uniform partitions | repartition(N) | Yes |
| N partitions hashed by a key | repartition(N, "key") | Yes |
| Roughly-sorted N partitions | repartitionByRange(N, "key") | Yes |
| Reduce partition count cheaply | coalesce(N) | No |
| One output file, big upstream | repartition(1) | Yes |
| One output file, tiny upstream | coalesce(1) | No |
That repartition(1) row is the one most people get wrong on first contact.
Run this on your own machine
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.appName("RepartitionDemo")
.master("local[*]")
.getOrCreate())
events = spark.range(0, 1_000_000).select(
F.col("id").alias("event_id"),
(F.col("id") % 1000).alias("user_id"),
)
print("default:", events.rdd.getNumPartitions())
# repartition up
up = events.repartition(50)
print("after repartition(50):", up.rdd.getNumPartitions())
up.explain()
# repartition by key
hashed = events.repartition(20, "user_id")
print("after repartition(20, user_id):", hashed.rdd.getNumPartitions())
hashed.explain()
# coalesce down — no shuffle
down = events.coalesce(2)
print("after coalesce(2):", down.rdd.getNumPartitions())
down.explain()
# coalesce up doesn't actually go up
no_op = events.coalesce(50)
print("after coalesce(50):", no_op.rdd.getNumPartitions()) # still 8
# Simulate the small-files write
(events.repartition(10)
.write.mode("overwrite").parquet("/tmp/repartition-out"))
Run each .explain() and look for the Exchange line. repartition always has one. coalesce never does. That’s the difference, in one word.
Next lesson: partitioned writes — partitionBy on disk, the directory layout it produces, and how Spark uses it to skip files at read time. Plus the cardinality trap that makes partitionBy("user_id") worse than no partitioning at all.
References: Apache Spark SQL documentation (https://spark.apache.org/docs/latest/sql-data-sources.html) and Databricks engineering posts on partition tuning. Retrieved 2026-05-01.