There are two shapes a tabular dataset can take, and almost every analytical workflow flips between them at least once. Long format has one row per observation: (user_id, metric_name, metric_value). Wide format has one row per entity, with each metric as its own column: (user_id, sessions, orders, revenue). Long is great for storage, aggregation, and ML pipelines; wide is great for reporting, exports, and human eyes.
The verbs that move between them are pivot (long to wide) and unpivot (wide to long, recently called melt in PySpark). Both are essential, both have specific PySpark gotchas worth knowing, and both can produce spectacular performance regressions if you reach for them carelessly. Let’s walk through each direction with the right idioms and the wrong ones to avoid.
Pivot: long to wide
The classic pivot starts from a long table and produces a wide one. Sales by country and quarter:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.appName("PivotDemo")
.master("local[*]")
.getOrCreate())
long_df = spark.createDataFrame(
[
("IT", "Q1", 100.0),
("IT", "Q2", 150.0),
("IT", "Q3", 120.0),
("IT", "Q4", 200.0),
("NL", "Q1", 80.0),
("NL", "Q2", 90.0),
("NL", "Q3", 110.0),
("NL", "Q4", 130.0),
("DE", "Q1", 50.0),
("DE", "Q2", 70.0),
],
"country STRING, quarter STRING, revenue DOUBLE",
)
wide_df = (long_df
.groupBy("country")
.pivot("quarter")
.agg(F.sum("revenue")))
wide_df.show()
# +-------+----+----+----+----+
# |country| Q1| Q2| Q3| Q4|
# +-------+----+----+----+----+
# | IT|100 |150 |120 |200 |
# | NL| 80 | 90 |110 |130 |
# | DE| 50 | 70 |null|null|
# +-------+----+----+----+----+
Three pieces:
groupBy("country")— the column(s) that stay rows in the output.pivot("quarter")— the column whose distinct values become column names.agg(F.sum("revenue"))— what to put in each cell where the row key and the column value coincide. You can pass any aggregate; pivot is fundamentally agroupBy+ spread.
DE has nulls for Q3 and Q4 because those rows didn’t exist in the input. That’s the right behaviour — a missing observation, not zero. If you want zeros, .fillna(0, subset=["Q3", "Q4"]) after the pivot.
Multiple aggregations are allowed:
(long_df
.groupBy("country")
.pivot("quarter")
.agg(F.sum("revenue").alias("rev"), F.count("*").alias("n"))
.show())
# Columns become: Q1_rev, Q1_n, Q2_rev, Q2_n, ...
The pivot performance trap
Here’s the thing nobody mentions in the first tutorial: pivot("quarter") without a list of values runs an extra job. Spark needs to know what columns to produce, so it kicks off a full scan of the source data to collect the distinct values of the pivot column, before it can plan the actual pivot.
For a small DataFrame this is invisible. For a 200 GB dataset partitioned across a thousand executors, that “discovery scan” is a real expense — sometimes more expensive than the pivot itself. And worse, it runs eagerly even though everything else in your pipeline is lazy.
The fix is to give Spark the list:
quarters = ["Q1", "Q2", "Q3", "Q4"]
(long_df
.groupBy("country")
.pivot("quarter", quarters)
.agg(F.sum("revenue"))
.show())
Now Spark knows exactly which columns to produce. No discovery scan, no extra job. If the data contains a value not in your list, it’s silently dropped from the output — which is usually what you want for reporting (you don’t want a stray test row to add a null_quarter column to your monthly report).
Get into the habit of always passing the list when you know it. Quarter names, currency codes, country codes, status enums — these are usually known up front. Save Spark the work.
Pivots that explode
A pivot is a wide transformation, and the wider the result, the more memory each executor needs to hold a single row. Pivoting on a column with 50 distinct values is fine. Pivoting on a column with 50 thousand distinct values produces a DataFrame where each row has 50 thousand columns, and that’s not really a DataFrame — that’s a sparse matrix with delusions of grandeur.
If your pivot column has high cardinality, you almost certainly want a different shape: long format with one row per (entity, metric) pair, possibly partitioned by metric. Reach for pivot when the result column count is in the tens or low hundreds, not thousands.
A useful guard: check the cardinality first.
n = long_df.select("quarter").distinct().count()
if n > 200:
raise ValueError(f"Pivot column has {n} distinct values, refusing to pivot")
This catches the case where someone points pivot at the wrong column at 3 AM on a Saturday.
Unpivot: wide to long
The reverse direction — wide to long — is more common than people realize. It comes up whenever you’ve received a CSV from a vendor, an Excel export from finance, or a denormalized warehouse table where each metric got its own column. To aggregate, join, or feed a model, you usually want it long.
In Spark 3.4 and newer, there’s a first-class operator: melt.
wide_df = spark.createDataFrame(
[
("IT", 100.0, 150.0, 120.0, 200.0),
("NL", 80.0, 90.0, 110.0, 130.0),
("DE", 50.0, 70.0, None, None),
],
"country STRING, Q1 DOUBLE, Q2 DOUBLE, Q3 DOUBLE, Q4 DOUBLE",
)
# Spark 3.4+
long_again = wide_df.melt(
ids=["country"],
values=["Q1", "Q2", "Q3", "Q4"],
variableColumnName="quarter",
valueColumnName="revenue",
)
long_again.show()
# +-------+-------+-------+
# |country|quarter|revenue|
# +-------+-------+-------+
# | IT| Q1| 100.0|
# | IT| Q2| 150.0|
# | IT| Q3| 120.0|
# | IT| Q4| 200.0|
# | NL| Q1| 80.0|
# ...
ids are the columns to keep as-is (one per row, repeated for each unpivoted value). values are the columns to unpivot. variableColumnName and valueColumnName name the two new columns the unpivot produces. Done.
The pre-3.4 trick
If you’re on Spark 3.3 or older — still common in managed platforms that lag a release or two — melt doesn’t exist. The idiom for the past several years has been stack inside a selectExpr:
# Spark < 3.4
long_again = wide_df.selectExpr(
"country",
"stack(4, 'Q1', Q1, 'Q2', Q2, 'Q3', Q3, 'Q4', Q4) AS (quarter, revenue)"
)
long_again.show()
stack(N, ...) takes the count of groups, then N pairs of (label, value). Each pair becomes a row. The AS (quarter, revenue) names the two output columns. It’s ugly compared to melt but it works on every Spark version, and it’s still useful when you need full control over the labels (different labels than column names, for instance).
For a programmatic version where the column list comes from config:
metric_cols = ["Q1", "Q2", "Q3", "Q4"]
stack_expr = f"stack({len(metric_cols)}, " + \
", ".join(f"'{c}', `{c}`" for c in metric_cols) + \
") AS (quarter, revenue)"
long_again = wide_df.selectExpr("country", stack_expr)
Backtick-quote the column references in case the names contain special characters. Single-quote the labels. Get the comma between them right.
This is also a reasonable case for falling back to SQL syntax — the selectExpr is essentially SQL anyway. Lesson 37’s “use SQL when the code mostly is SQL” rule applies.
Round-trip: pivot then unpivot
To prove they’re inverses, round-trip the data and confirm:
wide = long_df.groupBy("country").pivot("quarter", ["Q1","Q2","Q3","Q4"]).agg(F.sum("revenue"))
# Spark 3.4+
roundtrip = wide.melt(
ids=["country"],
values=["Q1", "Q2", "Q3", "Q4"],
variableColumnName="quarter",
valueColumnName="revenue",
).filter(F.col("revenue").isNotNull())
# Compare counts
print(long_df.count(), roundtrip.count())
# Same modulo the rows that didn't exist in long (DE Q3, Q4)
The filter(F.col("revenue").isNotNull()) strips the cells that the pivot filled with null because of missing input rows. Without it, the round-trip is “almost right” — every entity has a row for every metric, including the ones that never existed in the source. Sometimes that’s what you want; sometimes it isn’t. Be deliberate.
Choosing a shape for storage
Long format is almost always the right shape for storage:
- New metrics become new rows, not new columns. Schema evolution stops being a project.
- Aggregations across metrics work naturally —
groupBy("metric_name").agg(F.avg("metric_value"))is one query. - File compression is more uniform — sparse wide tables waste a lot of space on null markers in Parquet.
- ML feature stores and many time-series tools expect long input.
Wide format is the right shape for outputs: a report, a dashboard query, a CSV for a non-technical recipient. Pivot at the boundary of your pipeline, not in the middle.
If you find yourself pivoting and re-aggregating and unpivoting in the same job, that’s a smell. Stay long until the very last step, then pivot once on the way out.
A practical heuristic: if a downstream consumer is human, lean wide. If it’s another Spark job, ML training code, or a metrics database, lean long. The two consumers have different ergonomic preferences and the data should match the consumer.
There’s also a third format worth mentioning: a struct or map column. Instead of pivoting four quarters into four columns, you can collect them into a single MapType column with F.map_from_entries(F.collect_list(F.struct("quarter", "revenue"))). Reads stay narrow, the schema doesn’t explode when a new quarter shows up, and any analyst who needs a wide layout can select("country", "metrics.Q1", "metrics.Q2") to extract specific keys. It’s a middle ground that avoids the worst of both shapes; not always the right call, but worth keeping in your toolbox.
Run this on your own machine
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.appName("ReshapeDemo")
.master("local[*]")
.getOrCreate())
long_df = spark.createDataFrame(
[
("IT", "Q1", 100.0), ("IT", "Q2", 150.0),
("IT", "Q3", 120.0), ("IT", "Q4", 200.0),
("NL", "Q1", 80.0), ("NL", "Q2", 90.0),
("NL", "Q3", 110.0), ("NL", "Q4", 130.0),
("DE", "Q1", 50.0), ("DE", "Q2", 70.0),
],
"country STRING, quarter STRING, revenue DOUBLE",
)
# Pivot with explicit values — no discovery scan
wide = (long_df
.groupBy("country")
.pivot("quarter", ["Q1", "Q2", "Q3", "Q4"])
.agg(F.sum("revenue"))
.fillna(0))
wide.show()
# Unpivot, modern (Spark 3.4+)
try:
long_again = wide.melt(
ids=["country"],
values=["Q1", "Q2", "Q3", "Q4"],
variableColumnName="quarter",
valueColumnName="revenue",
)
long_again.orderBy("country", "quarter").show()
except AttributeError:
print("melt not available in this Spark version, falling back to stack")
# Unpivot, pre-3.4 fallback
metric_cols = ["Q1", "Q2", "Q3", "Q4"]
stack_expr = (
f"stack({len(metric_cols)}, " +
", ".join(f"'{c}', `{c}`" for c in metric_cols) +
") AS (quarter, revenue)"
)
long_stack = wide.selectExpr("country", stack_expr)
long_stack.orderBy("country", "quarter").show()
# Cardinality guard before pivot
n_distinct = long_df.select("quarter").distinct().count()
print(f"pivot column cardinality: {n_distinct}")
Run all three. The three long DataFrames at the end are equivalent (modulo the zeros where you used fillna). The cardinality print is your “is this safe to pivot” sanity check — get into the habit.
That wraps the data-shape toolkit. Module 7’s first three lessons gave you the full SQL surface (last lesson), windowing inside that surface (the lesson before this), and reshaping between the two layouts. Next lesson takes us into territory where Catalyst can’t help you anymore — user-defined functions, the escape hatch when SQL and DataFrame primitives don’t have what you need, and the performance cost that comes with crossing the JVM-Python boundary on every row.
References: Apache Spark DataFrame pivot and melt documentation (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.melt.html) and stack function reference. Retrieved 2026-05-01.