spark.sql.shuffle.partitions = 200 and why it's almost always wrong

Of all the configuration settings Spark ships with, exactly one is responsible for more slow jobs than the rest combined: spark.sql.shuffle.partitions. It’s set to 200 by default. It was a sensible default in 2014 when the typical Spark cluster was four laptops in a co-working space. On a modern cluster running modern data volumes, it’s almost always wrong — sometimes off by a factor of 10, sometimes by a factor of 100.

This lesson is about what this knob does, why the default doesn’t fit your job, and the math for picking a value that does.

What it controls

spark.sql.shuffle.partitions controls the number of output partitions after every shuffle. Group-by, join, distinct, window function, repartition — any operation that triggers a shuffle ends up writing exactly this many partitions to the next stage’s input.

spark.conf.get("spark.sql.shuffle.partitions")
# '200'

df.groupBy("country").count()  # output has 200 partitions
df.join(other, "id")           # output has 200 partitions
df.dropDuplicates()            # output has 200 partitions

Every shuffle. Every time. Regardless of input size, output size, cluster shape, or anything else about the job. The default is just the integer literal 200, frozen in time.

It’s worth pausing on the size of this hammer. The shuffle is the most expensive thing Spark does. The partition count after the shuffle determines the parallelism of every downstream stage. Setting this number wrong is the difference between a 3-minute job and a 3-hour job — and in production, between a job that fits in its window and one that pages someone at 4am.

Why 200 is wrong on a modern cluster

Two scenarios. Pick whichever feels more like yours.

Scenario 1: a real cluster, modest data

Imagine 50 executors with 4 cores each — 200 cores total. After a shuffle, you have 200 output partitions. Spark generates 200 tasks. The cluster runs all 200 of them simultaneously, and… everything works.

Sort of. Now suppose the post-shuffle data is 100 GB total. With 200 partitions, each task processes 500 MB. A task processing 500 MB on a JVM with reasonable memory: usually fine, finishes in a minute, no spill.

So far the default works. This is the case Spark’s default was tuned for, and it’s the reason most tutorial workloads never expose its weakness.

Scenario 2: same cluster, bigger data

Same 200 cores. But now post-shuffle data is 10 TB instead of 100 GB. With 200 partitions, each task processes 50 GB. A single task trying to sort, hash-join, or aggregate 50 GB of rows on one core:

Spills to disk repeatedly because the task can’t fit 50 GB of state in memory.
Spill writes are slow disk I/O. Now the task is I/O bound.
The shuffle write phase produces 50 GB of output for one task, which is also slow.
If anything skews on top of this, one task gets 100 GB and runs for hours.

The job that “should” finish in 20 minutes runs for 4 hours. CPU dashboards show low utilization because most of the time is spent spilling and reading from disk. Every increase in cores makes no difference because there are only 200 tasks.

The cure is more partitions. Not 2x more — 50x more. With spark.sql.shuffle.partitions = 10000 and the same 10 TB, each task processes 1 GB. Tasks finish in a couple of minutes each, all 200 cores stay busy across many waves, total wall clock drops dramatically.

The flip side also exists. If your post-shuffle data is 1 GB total and you keep 200 partitions, each task does 5 MB of work — finishes in 50 ms, and the per-task overhead (scheduling, classloading, shuffle bookkeeping) dominates. Better to coalesce down to 16 partitions and let each task chew through 64 MB.

In short: 200 is right for one specific scale of data and cluster. Anything outside that scale, the number is wrong.

The tuning rule

A workable rule of thumb, used widely in practice and recommended by Databricks:

Aim for 100-200 MB per shuffle partition.

That sizes a task’s working set to fit comfortably in executor memory, finish in a reasonable wall clock, and not waste time on per-task overhead. The math:

shuffle_partitions ≈ total_shuffle_data_GB × 1024 / target_MB_per_partition

If you’re shuffling 500 GB of data and targeting 128 MB per partition:

shuffle_partitions ≈ 500 × 1024 / 128 ≈ 4000

Round to a clean multiple of your core count if you want — 4000 works fine, 4096 is slightly nicer if you have 256 cores because every task lands evenly across waves. Don’t agonize over the exact number; an order-of-magnitude correct value beats the default by miles.

The tricky part is knowing the post-shuffle data size. Three ways to estimate:

From the input. If you’re reading 500 GB of input and the shuffle is roughly the same size (a join that doesn’t filter much), assume 500 GB.

From a small run. Run the job on a sample, look at the Spark UI, click into the shuffle stage, read the “Shuffle Write” total. Multiply by 1 / sample_fraction.

From experience. After a few jobs you’ll have a sense of which queries blow up post-shuffle and which shrink. A groupBy("user_id").agg(...) typically shrinks the data 100-1000x. A self-join typically grows it 10-100x. A window function over the full table preserves the row count exactly.

Setting it

Three places, in increasing order of scope:

Per job, in code:

spark.conf.set("spark.sql.shuffle.partitions", "4000")

This is what you usually want. Tune it for the specific job. Different jobs have wildly different shuffle sizes, and a single global number can’t be right for all of them.

At session start:

spark = (SparkSession.builder
         .appName("BigJoin")
         .config("spark.sql.shuffle.partitions", "4000")
         .getOrCreate())

Same effect, set declaratively at the top.

In the cluster default (spark-defaults.conf):

spark.sql.shuffle.partitions  4000

Sets the floor for everything running on the cluster. Useful when the cluster has a typical workload size and you want to override the 200 default once. Still doesn’t replace per-job tuning for outliers.

You can also read the current value back to confirm:

print(spark.conf.get("spark.sql.shuffle.partitions"))

Useful in the first lines of a job for traceability — if a future you opens the logs and wonders what the shuffle partitions were, you’ll be glad you logged it.

Three quick worked examples

Small workload — daily ETL on 5 GB of data.

Post-shuffle: ~5 GB. Target: 100 MB per partition. Math: 5 × 1024 / 100 ≈ 50.

Set spark.sql.shuffle.partitions = 64. The default 200 would give you tasks of 25 MB each — fine, but more overhead than necessary. 64 is enough parallelism for any reasonable cluster and keeps tasks at a healthy 80 MB.

Medium workload — hourly aggregation on 200 GB.

Post-shuffle (after group-by, often shrinks): assume 50 GB. Target: 128 MB. Math: 50 × 1024 / 128 ≈ 400.

Set spark.sql.shuffle.partitions = 400. The default 200 would give 250 MB per task — borderline, possibly OK, but cutting close to spill territory. 400 is comfortable.

Large workload — joining two 5 TB tables.

Post-shuffle: 5 TB on each side, full sort-merge join. Target: 200 MB. Math: 5000 × 1024 / 200 ≈ 25600.

Set spark.sql.shuffle.partitions = 25000. The default 200 would give you 25 GB per task — guaranteed catastrophe. This is the workload class where leaving the default in place turns a tractable job into an outage.

Other knobs in the same neighborhood

A few related settings you’ll see referenced together:

spark.default.parallelism — the RDD-API equivalent. Doesn’t affect DataFrames. Mostly historical.
spark.sql.files.maxPartitionBytes — controls the size of partitions on read (default 128 MB). Different concept, different stage. Lesson 33 covers this.
spark.sql.adaptive.coalescePartitions.enabled — part of AQE, see below.

spark.sql.shuffle.partitions is the one that matters for shuffle stages, which is most of where time is spent.

The AQE upgrade

In Spark 3.x, Adaptive Query Execution (AQE) can dynamically adjust shuffle partitions at runtime. With:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

…Spark looks at the actual size of each shuffle output and coalesces small partitions together post-shuffle, so you don’t end up with thousands of tiny tasks if your data turns out to be smaller than expected. You can set spark.sql.shuffle.partitions high (say 5000) and let AQE collapse it down to the right number for the actual data. AQE also handles skew (lesson 29) and switches join strategies on the fly when statistics indicate it should.

This significantly reduces the cost of getting spark.sql.shuffle.partitions wrong on the high side. If your data turns out smaller than expected, AQE quietly fixes it. But it doesn’t fix the low side — if you set 200 and your data is 10 TB, AQE has nothing to coalesce; the partitions are already too few. Picking a reasonable starting value still matters.

Lesson 59 goes deep on AQE — the four optimizations it does, when each one kicks in, what to log to confirm it worked, and where its limits are. For now: turn it on, set a reasonable starting value for shuffle partitions, and let it adjust.

What to actually do

A short checklist for any new Spark job:

Estimate post-shuffle data size. Even a rough order of magnitude is enough.
Compute the partition count. data_GB × 1024 / 128 MB, round to something clean.
Set spark.sql.shuffle.partitions at the top of the job. Don’t rely on the default. Don’t rely on a cluster-level config you can’t see.
Enable AQE. Both adaptive.enabled and adaptive.coalescePartitions.enabled. Free safety net.
Watch the Spark UI on the first run. If the shuffle stage has tasks running 4 minutes each, halve the partition count next time. If they’re running 4 seconds each, double it. Iterate until tasks are in the tens-of-seconds range and the cluster is busy.

The next two lessons go into repartition and coalesce — the explicit operators for changing partition counts mid-job — and the difference between the two, which is also a common source of silent slowness.

The single biggest win you can get from Module 6, though, is just this: stop letting spark.sql.shuffle.partitions = 200 be the answer for every job. Set it deliberately. Future you will save hours, and your cluster bill will drop noticeably.

References: Apache Spark documentation on SQL configuration and Adaptive Query Execution; Databricks tuning guide on shuffle partitions and partition sizing. Retrieved 2026-05-01.