PySpark, from the ground up Lesson 31 / 60

What a partition is, physically

Partitions in memory, partitions on disk, and the relationship between partitions and tasks.

We’ve been throwing the word “partition” around for the entire course. Time to slow down and make sure it means the same thing in your head as it does in mine — because there are two senses of the word, they’re related but distinct, and confusing them is the source of half the partition bugs in the wild.

This lesson is about what a partition is — physically, in memory and on disk — and what its relationship is to the unit of execution Spark actually runs, which is a task. Once that picture is clear, the next lesson (the spark.sql.shuffle.partitions knob and why the default is wrong) makes obvious sense.

The runtime partition

A partition, in the runtime sense, is the chunk of data Spark processes as a single unit of parallel work. One partition. One task. One core. At any given moment.

That’s the identity to internalize:

1 partition → 1 task → 1 core, for the duration of the task.

If your DataFrame has 1000 partitions and you trigger a stage that processes it, Spark generates 1000 tasks. If your cluster has 200 cores available, those 1000 tasks run in 5 waves of 200 each — at any moment, 200 tasks are running in parallel, the rest are queued. If your cluster has 2000 cores, all 1000 tasks run simultaneously and 1000 cores sit idle. You can’t get more parallelism than partitions.

Inside an executor, a partition is just a subset of the rows of the DataFrame, held in memory (or spilled to disk if memory is tight). Each task processes its partition row by row — applies the filters, transformations, aggregations, whatever the stage’s plan says — and writes its output, either to the next stage’s shuffle files or to whatever sink the action targeted.

The partitions of a DataFrame don’t live anywhere centrally; they live distributed across the executors. That’s the entire point of distributed processing.

The on-disk partition

The other use of “partition” is on disk: a directory in a partitioned write.

/data/transactions/
  year=2024/
    month=01/
      part-00000.parquet
      part-00001.parquet
      ...
    month=02/
      ...
  year=2025/
    month=01/
      ...

When you write df.write.partitionBy("year", "month").parquet(...), Spark creates one directory per distinct combination of partition column values. Reading the data back, Spark uses those directory names to do partition pruning — if you filter WHERE year = 2024, Spark only opens files under year=2024/ and skips the rest entirely.

These on-disk partitions are different from runtime partitions. A single on-disk partition (one year=2024/month=03/ directory) can contain many files; reading it produces many runtime partitions. Conversely, a runtime partition can write multiple on-disk partitions (if the data within it spans several year/month values).

The two senses are related — both are about chunking data — but they’re separate concepts living at different layers. When someone says “this job has too many partitions,” ask which kind. Runtime partitions affect parallelism and shuffle cost. On-disk partitions affect read pruning and directory bloat. The fixes are different.

For the rest of this lesson — and most of Module 6 — “partition” means the runtime sense unless explicitly noted.

Tasks per stage = partitions of the input

The simplest, most useful identity in Spark performance work:

The number of tasks in a stage equals the number of partitions of that stage’s input.

A stage starts at a shuffle boundary (or at a read) and ends at the next shuffle (or at the action). Within a stage, every task processes one partition of the stage’s input through the stage’s plan, end to end. So:

  • A read of 800 input files (with one partition per file) creates a stage with 800 tasks.
  • A groupBy shuffle output of 200 partitions creates a downstream stage with 200 tasks.
  • A repartition(50) on a DataFrame causes the next stage to have 50 tasks.

That’s why people obsess over partition counts. The partition count is the parallelism. Too few partitions and the cluster is underused; too many and you waste time scheduling tiny tasks.

Inspecting partitions

You don’t have to guess. Two cheap ways to look at what a DataFrame’s partitions actually are:

How many partitions:

df.rdd.getNumPartitions()
# 200

This is the runtime partition count — the one that determines task count. Works on any DataFrame. Cheap.

How big each partition is:

from pyspark.sql import functions as F

(df.withColumn("pid", F.spark_partition_id())
   .groupBy("pid")
   .count()
   .orderBy("pid")
   .show(50))

# +---+------+
# |pid| count|
# +---+------+
# |  0| 50012|
# |  1| 49873|
# |  2| 49991|
# ...
# |199| 50104|
# +---+------+

spark_partition_id() returns the integer ID of the partition each row currently lives in. Group by it, count, look at the distribution. If every partition has roughly the same row count, the data is balanced. If one partition has 10x the median, you have skew (lesson 28) and the next shuffle is going to hurt.

This is the same diagnostic from the skew lessons, but now you understand exactly what it’s telling you: the row count per parallel work unit. A partition with 10 million rows is one task processing 10 million rows. A partition with 100 rows is one task processing 100 rows. The cluster doesn’t care how many partitions there are in total; it cares about the slowest one.

A useful follow-up: byte size, not just row count. Row counts can be misleading — wide rows with arrays and structs can be 100x heavier than narrow rows of integers. A rough way to check, when you suspect partition imbalance is more than just row counts:

def partition_size_bytes(df):
    """Per-partition byte size estimate via the RDD interface."""
    return (df.rdd
              .mapPartitionsWithIndex(
                  lambda idx, it: [(idx, sum(len(str(r)) for r in it))])
              .collect())

for pid, nbytes in sorted(partition_size_bytes(df))[:5]:
    print(f"partition {pid}: {nbytes:,} bytes (string-len approximation)")

It’s not perfect — str(row) over-counts compared to the serialized size — but the ratios between partitions tell you what you need to know. If partition 0 is 50x the size of partition 1, you’ve got skew that row counts alone might not have shown.

Default partition counts

Where do partition counts come from in the first place? Three main sources:

On read. When you read a file-based source — Parquet, CSV, JSON, ORC — Spark creates one partition per file by default, with a few caveats:

  • For HDFS and HDFS-like stores, very large files are split by HDFS block size (typically 128 MB or 256 MB). One partition per block.
  • For S3 and similar object stores, Spark uses spark.sql.files.maxPartitionBytes (default 128 MB) to split big files.
  • Tiny files are sometimes coalesced via spark.sql.files.openCostInBytes to avoid the overhead of opening many small files for nothing.

The practical takeaway: if you read a directory of 800 Parquet files, expect roughly 800 partitions. If those files are huge, expect more. If they’re tiny, expect fewer.

After a shuffle. Every shuffle output is repartitioned according to spark.sql.shuffle.partitions (default 200). Group-by, join, distinct, window — anything that requires a shuffle — produces 200 partitions on the other side, regardless of input size. This default is almost always wrong, which is exactly what lesson 32 is about.

Explicit repartition. df.repartition(N) and df.coalesce(N) set the partition count explicitly. We’ll cover those in detail in lessons 33 and 34, but for now: repartition reshuffles to exactly N partitions, coalesce merges existing partitions without a full shuffle.

There’s also an indirect source: any operation that preserves partitioning (a filter, a select, a withColumn that doesn’t shuffle) keeps the partition count it inherited. So if you read 800 files, filter out 95% of the rows, and then group-by, the filter step still has 800 partitions — most of them tiny — until the group-by shuffle changes things. That sounds wasteful, and it sometimes is, which is why Module 6 has a whole lesson on coalesce for exactly this case.

A mental picture

If you want a single image to keep in your head, think of partitions as slices of a pizza.

  • The pizza is your DataFrame.
  • The slices are the partitions.
  • The people sitting around the table are the executors.
  • Each slice gets eaten by one person at a time, one bite at a time — that’s the task.
  • A pizza with 200 slices and 8 people: each person eats 25 slices, one at a time. The meal is done when everyone finishes.
  • A pizza with one giant slice and 199 tiny ones: 199 people finish their tiny slices in seconds; one person is still chewing through the giant slice 30 minutes later. That’s skew.
  • A pizza cut into 4 slices for 200 people: 196 of them have nothing to do. That’s underpartitioning.
  • A pizza cut into 10000 microscopic slices: everyone spends more time picking up slices than eating. That’s overpartitioning, and the overhead is real — every task has scheduling cost, every task writes its own shuffle file, every task is a small bookkeeping burden on the driver.

The whole game is: cut the pizza into roughly equal slices, of a size that keeps every eater busy without anyone choking and without too much fiddly per-slice overhead. The “right” slice size depends on your workload, your cluster, and your data; the next lesson is about how to pick it.

Why this matters now

Module 6 is going to get into the knobs and patterns of partitioning: the shuffle partition default, repartition vs. coalesce, partitioning on write, bucketing. None of that makes sense without the picture from this lesson. Before we tune anything:

  • A partition is the chunk of data Spark processes as one unit.
  • One partition = one task = one core, at any moment.
  • Tasks per stage = partitions of that stage’s input.
  • Default partitions on read ~ number of input files (or block-sized chunks for big files).
  • Default partitions after shuffle = spark.sql.shuffle.partitions (200, almost always wrong).
  • On-disk partitions are a separate concept — directories in a partitioned write — even though the word is the same.

Internalize that and the rest of Module 6 is just consequences. Lesson 32 walks straight into the most consequential default in Spark and shows how to set it correctly for your workload.


References: Apache Spark documentation on RDD partitioning and the SQL execution model; Databricks tuning guide on partitions and parallelism. Retrieved 2026-05-01.

Search