Python, from the ground up Lesson 31 / 60

Reshape: pivot, melt, stack, unstack

Wide-to-long, long-to-wide, and when MultiIndex columns are a feature instead of a bug.

Tabular data has a shape. The same numbers can sit in a DataFrame in two very different layouts, and which one you want depends entirely on what you’re about to do next. Today we cover reshaping — the four functions (melt, pivot, stack, unstack) that move data between layouts, the trap that catches people on pivot, and why MultiIndex columns are sometimes useful and usually annoying.

Wide vs long

A wide table has one row per entity and many columns for the attributes:

order_id    Q1     Q2     Q3     Q4
1          120    150    180    210
2           90    100    110    130

A long table has one row per entity-attribute pair:

order_id    quarter    sales
1           Q1         120
1           Q2         150
1           Q3         180
1           Q4         210
2           Q1          90
...

The same information, rearranged. Each shape has its place. Wide is what humans read in a report — quarters across the top, totals at the bottom, eyes scan left-to-right. Long is what databases store, what machine-learning pipelines expect, what groupby and most plotting libraries assume. SQL is fundamentally a long-table language. The faster you become at converting between the two, the less time you spend wrestling pandas.

The general rule: wide for reporting, long for processing. If you’re emailing a CFO a quarterly summary, go wide. If you’re feeding it into a model, computing aggregations, or shipping it to a database, go long.

Wide to long: melt

melt takes a wide DataFrame and stacks the chosen columns into two new ones — one for the old column names, one for the values.

import pandas as pd

wide = pd.DataFrame({
    "order_id": [1, 2, 3],
    "Q1": [120, 90, 200],
    "Q2": [150, 100, 220],
    "Q3": [180, 110, 240],
    "Q4": [210, 130, 260],
})

long = wide.melt(
    id_vars=["order_id"],
    value_vars=["Q1", "Q2", "Q3", "Q4"],
    var_name="quarter",
    value_name="sales",
)

The four arguments worth knowing:

  • id_vars — columns that stay as identifying columns; they get repeated once per melted row.
  • value_vars — columns to melt. If you omit it, every column not in id_vars gets melted.
  • var_name — the name of the new column that holds the old column names. Default is "variable", which is never what you want; always set this.
  • value_name — the name of the new column that holds the values. Default is "value", also rarely right.

After this, long has 12 rows (3 orders × 4 quarters) and the layout most analytical tools expect. You can group by quarter, plot quarter-on-quarter trends, join against a calendar table — all things that are awkward in the wide layout.

A common mistake: forgetting that id_vars matters. If your wide DataFrame has metadata columns (region, customer name, currency) that should travel with each melted row, list them all in id_vars or you’ll lose them.

Long to wide: pivot

pivot is melt’s inverse. Given a long DataFrame, you pick one column to become the new row index, one column to become the new column headers, and one column to fill the cells.

back_to_wide = long.pivot(
    index="order_id",
    columns="quarter",
    values="sales",
)

Result:

quarter      Q1     Q2     Q3     Q4
order_id
1           120    150    180    210
2            90    100    110    130
3           200    220    240    260

Clean enough on a clean input. But there’s a trap.

The duplicates trap

pivot requires the (index, columns) pairs to be unique. The moment you have two rows with the same order_id and quarter — say because the original table is per-line-item rather than per-order — pivot fails with ValueError: Index contains duplicate entries, cannot reshape.

The fix is pivot_table, which takes an aggfunc and aggregates duplicates instead of erroring:

df.pivot_table(
    index="order_id",
    columns="quarter",
    values="sales",
    aggfunc="sum",  # or "mean", "max", a custom callable, a list of them
    fill_value=0,   # what to put in cells that have no data
)

I default to pivot_table in production code. It’s strictly more powerful than pivot and makes the aggregation explicit. The only reason to reach for pivot is when you genuinely want it to error if duplicates appear — a sanity check that your upstream join didn’t go wrong.

pivot_table also accepts multiple values= columns and multiple aggfunc= entries, which is how you get a hierarchical reporting layout (sales sum, sales count, average sale value all in one table). That’s where MultiIndex columns enter the picture.

MultiIndex columns

When you pivot_table with two value columns or two aggfuncs, the columns of the result become a MultiIndex — columns with two levels stacked on top of each other:

report = sales.pivot_table(
    index="region",
    columns="quarter",
    values=["revenue", "units"],
    aggfunc="sum",
)
# columns are now: ('revenue', 'Q1'), ('revenue', 'Q2'), ..., ('units', 'Q1'), ...

This is great for export-to-Excel reports, where the two-row header reads naturally. It’s a pain for almost everything else: SQL doesn’t have hierarchical columns, ML libraries don’t either, and accessing a single column means writing report[("revenue", "Q1")] instead of report["revenue_Q1"].

To flatten a MultiIndex column into single-level names:

report.columns = ["_".join(map(str, c)).strip("_") for c in report.columns]
# now: revenue_Q1, revenue_Q2, ..., units_Q1, units_Q2, ...

I do this almost reflexively after any pivot_table that produces a MultiIndex. It saves so much downstream pain that the cost of one extra line is negligible.

stack and unstack

stack and unstack move data between rows and columns at the index level — they’re the MultiIndex equivalents of melt and pivot.

stack() takes the innermost level of the columns and moves it down into a new innermost level of the row index:

report.stack()  # quarters move from column level into a row MultiIndex

unstack() is the reverse — it takes the innermost level of the row index and moves it up into a new column level:

long_indexed.unstack()  # innermost row index becomes columns

In practice, unstack is more useful: many pandas operations (especially groupby with multiple keys) leave you with a MultiIndex Series, and unstack is the one-liner that turns it into a clean wide DataFrame:

sales.groupby(["region", "quarter"])["amount"].sum().unstack()

You get a DataFrame indexed by region with one column per quarter. This pattern — groupby two columns, sum the value, unstack the second key — is so common it’s worth memorising.

You can specify which level to stack/unstack with the level= argument: df.unstack(level="quarter") or df.unstack(level=0). Default is the innermost.

reset_index: flatten the MultiIndex back to columns

After all this MultiIndex manipulation, the typical final step is to put everything back into plain columns for export. reset_index() does exactly that:

flat = report.reset_index()
# the index becomes a normal column, the result is a flat DataFrame

If the index has multiple levels, all of them become columns. If you only want some of them flattened, pass level=. The mirror operation, set_index(), goes the other way and is useful before joins or unstacks.

A real example: hierarchical sales report to a SQL export

Suppose you have raw sales data, one row per line item, and you want to produce two outputs: a hierarchical report for the CFO (quarters across, regions and product categories nested as rows), and a flat long table for the data warehouse.

import pandas as pd

sales = pd.DataFrame({
    "region": ["EU", "EU", "US", "US", "EU", "US"],
    "category": ["A", "B", "A", "B", "A", "A"],
    "quarter": ["Q1", "Q1", "Q1", "Q1", "Q2", "Q2"],
    "amount": [100, 50, 200, 150, 120, 220],
})

# CFO report: regions and categories down, quarters across
report = sales.pivot_table(
    index=["region", "category"],
    columns="quarter",
    values="amount",
    aggfunc="sum",
    fill_value=0,
)
report.to_excel("cfo_report.xlsx")

# Warehouse export: long, flat, one row per region-category-quarter
warehouse = (
    sales
    .groupby(["region", "category", "quarter"], as_index=False)["amount"]
    .sum()
)
warehouse.to_parquet("sales_long.parquet")

Same source, two shapes, two consumers, each happy. The CFO gets the wide hierarchical layout her eyes can scan; the warehouse gets the long layout SQL can join against. Reshaping is the bridge.

When to use which

A rough decision tree:

  • Going from “spreadsheet humans made” to “data I can analyse”: melt.
  • Going from “data I analysed” to “table for a human”: pivot_table.
  • Already have a MultiIndex Series from a multi-key groupby: unstack.
  • Need to flatten a MultiIndex back to plain columns for export: reset_index + the column-flattening list comprehension.

Most reshape problems are one of these four cases. When you find yourself writing complicated reshape logic, stop and check — there’s almost always a one-liner you missed.

What’s next

Lesson 32 is time series — DatetimeIndex, resample, rolling windows, and the timezone trap that bites every production pipeline at least once. Pandas was originally built for financial time series, and it shows: the time-series API is one of the most polished parts of the library.

Further reading

See you Friday.

Search