If selection is “pick rows,” aggregation is “summarise rows.” Once you’ve filtered down to the data you care about, almost any analysis ends with a question shaped like “average per country,” “max per customer,” “first event per session.” Pandas has one core operation for this — groupby — and four methods you can call on the result: agg, transform, apply, filter. They look similar and they do very different things. Today we get the distinction right, because choosing the wrong one is the difference between a query that runs in 30 milliseconds and one that runs in 30 seconds.
We’ll work with this throughout:
import pandas as pd
pd.set_option("mode.copy_on_write", True)
df = pd.DataFrame(
{
"country": ["IT", "DE", "FR", "IT", "DE", "FR", "IT", "DE"],
"year": [2024, 2024, 2024, 2025, 2025, 2025, 2025, 2024],
"channel": ["web", "web", "web", "app", "app", "web", "web", "app"],
"revenue": [120.0, 340.0, 210.0, 145.0, 380.0, 225.0, 95.0, 290.0],
"orders": [12, 25, 18, 15, 28, 19, 9, 22],
}
)
The basic shape
df.groupby("col") returns a DataFrameGroupBy object — not a result, an intermediate. Pandas doesn’t actually iterate the data until you call an aggregation method on it:
df.groupby("country").sum(numeric_only=True)
df.groupby("country").mean(numeric_only=True)
df.groupby("country").size() # row count per group
numeric_only=True is good practice — without it, pandas will try to sum strings (which “works” by concatenation) or error on mixed types. In pandas 2.x the default is to error on non-numeric columns for sum/mean, which is the right default but does mean you set the flag explicitly when you want it.
The output is a DataFrame indexed by the group key. The group key has moved from being a column to being the index — handy for plotting and for further joins, but a common source of “wait, where did my country column go?” If you want it as a column, either call .reset_index() or pass as_index=False:
df.groupby("country", as_index=False).sum(numeric_only=True)
Multi-key groupby — group by a tuple of columns:
df.groupby(["country", "year"]).sum(numeric_only=True)
The result has a MultiIndex on rows. You can move both keys back to columns with .reset_index() or use as_index=False.
.agg — different aggregations per column
sum, mean, min, max, count, nunique, median, std, var, first, last — all available as methods on the groupby. When you want different aggregations for different columns, that’s .agg:
df.groupby("country").agg(
{
"revenue": "sum",
"orders": "mean",
}
)
You can pass a list per column to get multiple aggregations:
df.groupby("country").agg(
{
"revenue": ["sum", "mean", "max"],
"orders": ["sum", "min"],
}
)
The result has a MultiIndex on the columns: ("revenue", "sum"), ("revenue", "mean"), etc. That’s awkward to work with downstream — df["revenue"]["sum"] instead of df["revenue_sum"]. Which brings us to the modern way.
Named aggregations — the way to do it
Pandas 0.25 added named aggregations, and once you see them you’ll never go back. The form:
df.groupby("country").agg(
total_revenue=("revenue", "sum"),
avg_revenue=("revenue", "mean"),
total_orders=("orders", "sum"),
n_rows=("revenue", "size"),
)
Each keyword argument is a (source_column, aggregation_function) tuple, and the keyword name becomes the resulting column name. No MultiIndex. Flat columns. Self-documenting. This is the form you should be writing in 2026.
Custom aggregations? Pass a callable:
df.groupby("country").agg(
revenue_range=("revenue", lambda s: s.max() - s.min()),
p95_revenue=("revenue", lambda s: s.quantile(0.95)),
)
lambda s: ... receives a Series — the values of that column for one group. The return value can be a scalar (becomes one cell) or a tuple/list (rare, advanced).
For “count rows per group,” prefer ("any_column", "size") to ("any_column", "count") — see the next section for why.
size vs count, briefly
Two functions that sound the same and aren’t:
.size()returns the number of rows in each group, including rows where the value isNaN. It’s a property of the group, not of any one column..count()returns the number of non-null values in each column of each group. It varies by column.
df.groupby("country").size() # rows per country
df.groupby("country").count() # non-null per column per country
df.groupby("country").agg(n=("revenue", "size")) # named version
If you want “how many rows are in this group,” use size. If you want “how many of the rows have a non-null revenue,” use count. Mistaking one for the other quietly produces wrong dashboards.
.transform — broadcast back to the original shape
agg reduces each group to one row. transform does an aggregation but broadcasts the result back to every row in the group, so the output has the same length as the input. This is the killer feature you didn’t know you needed.
The classic use: compute a per-row deviation from the group mean.
df["country_avg_revenue"] = df.groupby("country")["revenue"].transform("mean")
df["revenue_vs_avg"] = df["revenue"] - df["country_avg_revenue"]
Or, normalise within group:
df["revenue_share_of_country"] = (
df["revenue"] / df.groupby("country")["revenue"].transform("sum")
)
This is one expression, vectorised, fast. The alternative — build a separate aggregated DataFrame, then merge it back on the group key — is three lines, two extra objects, and an order of magnitude slower. Whenever you find yourself wanting a per-row value that depends on its group, reach for transform.
transform accepts the same string aggregations as agg ("mean", "sum", "max", "min", "count", "std", "rank"), plus any callable that takes a Series and returns a Series of the same length:
df["revenue_z"] = df.groupby("country")["revenue"].transform(
lambda s: (s - s.mean()) / s.std()
)
.apply — the escape hatch (use sparingly)
apply is the most flexible groupby method and the slowest. It calls a function once per group, with the group’s full DataFrame, and stitches the results together:
def top_n(group: pd.DataFrame, n: int = 2) -> pd.DataFrame:
return group.nlargest(n, "revenue")
df.groupby("country").apply(top_n, n=2, include_groups=False)
Pandas 2.2+ requires include_groups=False on apply when the groupby keys are also being passed in — otherwise it’ll deprecate-warn you. Set it.
When is apply the right answer? When the operation genuinely needs the whole group’s DataFrame and can’t be expressed as a per-column reduction or per-row broadcast. “Top N rows per group,” “fit a regression per group,” “complex multi-column logic that doesn’t map to agg.” For everything else — column-by-column aggregation, per-row group statistics — use agg or transform. They’re vectorised; apply is a Python loop in disguise.
A rough rule: if you can express it without apply, you should.
.filter — keep only some groups
filter takes a function that returns True/False per group and keeps the rows of groups where the function returns True:
df.groupby("country").filter(lambda g: len(g) >= 3)
That returns the original-shape DataFrame, but only with rows from countries that have at least three records. Useful for “drop rare categories before training a model” or “ignore one-off customers.” Note that filter here is the groupby method, not the DataFrame .filter method (which is something completely different, a column-name selector). Pandas naming, sigh.
Pivot tables — groupby plus reshape
pivot_table is groupby with a built-in reshape. Where groupby([a, b]).agg(...) produces a long-format DataFrame with a MultiIndex, pivot_table produces a wide-format one — one of the keys becomes columns:
df.pivot_table(
values="revenue",
index="country",
columns="year",
aggfunc="sum",
fill_value=0,
)
You get a DataFrame with countries as rows, years as columns, summed revenue in the cells. fill_value=0 says “if a (country, year) combination has no data, write zero instead of NaN” — usually what you want for a presentation table.
Multiple aggregations and value columns work too:
df.pivot_table(
values=["revenue", "orders"],
index="country",
columns="year",
aggfunc={"revenue": "sum", "orders": "mean"},
)
The result is wide and presentable. For machine-readable downstream consumption, prefer the long form (groupby + named agg) and reshape only at the end.
Combining: filter, group, aggregate, sort
Real analysis chains these. Pandas reads top-to-bottom when you wrap a method chain in parentheses:
result = (
df
.loc[df["revenue"].notna()]
.groupby(["country", "year"], as_index=False)
.agg(
total_revenue=("revenue", "sum"),
total_orders=("orders", "sum"),
avg_revenue=("revenue", "mean"),
n_rows=("revenue", "size"),
)
.assign(revenue_per_order=lambda d: d["total_revenue"] / d["total_orders"])
.sort_values(["country", "year"])
.reset_index(drop=True)
)
.assign creates a new column from a lambda that receives the in-progress DataFrame — useful in chains because you can’t write df["x"] = ... mid-chain. .sort_values sorts rows; .reset_index(drop=True) rebuilds a clean RangeIndex at the end so the result is presentable.
This is the shape of 90% of the analytical pandas code I write: filter, groupby with named aggs, derived columns via assign, sort, reset. Memorise the skeleton.
A grab-bag of aggregations worth knowing
A few specific aggregations come up often enough to call out by name:
first/last— return the first/last value in each group. Order matters, so usually preceded bydf.sort_values(...). Useful for “the customer’s first order amount” or “the latest status per ticket.”nunique— distinct value count.df.groupby("country")["customer_id"].nunique()gives you the number of distinct customers per country.idxmax/idxmin— the index label of the row with the max/min value. Combined with.loc, this is how you pick the full row containing the per-group max:df.loc[df.groupby("country")["revenue"].idxmax()].quantile—df.groupby("country")["revenue"].quantile(0.95). The 95th percentile per group, etc. With named aggs:p95=("revenue", lambda s: s.quantile(0.95)).agg(list)— collects all values in the group into a Python list, one cell per group. Handy for “what items were in this order?” Slow on big data; fine for human-scale outputs.
Performance notes
- Vectorised aggregations are 10-100x faster than
.applywith a Python function. “sum”, “mean”, “max” etc. dispatch to C; lambdas don’t. If you find yourself reaching for.applyin a hot path, ask first whetheraggplus a creative use oftransformcan do the same job — usually it can. as_index=Falseis cheaper than.reset_index()because pandas doesn’t build the MultiIndex just to throw it away. For pipeline code that always ends in a flat DataFrame, default toas_index=False.observed=Truematters with categorical group keys. By default, groupby on a category column produces a row for every category value, even those with no data. Withobserved=True, only present values appear — which is almost always what you want and notably faster for high-cardinality categories. Pandas 3.0 will flip the default; in the meantime, set it.sort=Falseskips the alphabetical sort of group keys. The defaultsort=Trueis fine for small results but adds noticeable overhead on millions of groups, and you probably want a domain-specific sort at the end of the pipeline anyway.- For really large data, the groupby-then-agg pattern is what DuckDB and Polars do natively and faster. If your DataFrame is more than a few GB, consider doing the aggregation in DuckDB (
duckdb.query("SELECT country, SUM(revenue) FROM df GROUP BY country").to_df()) and reading the small result back into pandas.
What’s next
Lesson 30 is joins — merge, concat, and the index-alignment behaviour that’s saved or ruined many an analysis. Lesson 32 will cover window functions (rolling, expanding, ranking) — the time-series-flavoured cousins of what we did with transform today.
Further reading
- pandas: Group by — the user guide, with named aggregations covered properly.
- pandas: Reshaping and pivot tables — pivot, melt, stack, unstack.
- pandas:
transform— the reference page; the user guide section ontransformis especially worth reading.
See you Friday.