Python, from the ground up Lesson 34 / 60

When pandas is slow: chunks, dtypes, and where to look

The five biggest pandas performance levers, the diagnostic loop, and when 'rewrite in Polars' is the right answer.

You’ve written a pandas script. It does the right thing on the test sample. Then you point it at the real file and walk away to make coffee, and when you come back it’s still running. Or worse, it’s running on a colleague’s laptop and they’re politely asking how long this is supposed to take.

This lesson is the diagnostic loop for pandas performance: how to figure out why it’s slow, the five things that are almost always the cause, and the pragmatic ceiling at which the right answer stops being “optimize the pandas code” and starts being “rewrite in Polars or DuckDB.” We saw the dtype side of this in lesson 33 (categories, string[pyarrow], nullable ints); today we put it in the broader frame.

Pandas slowness has five flavors

In my experience there are really only five reasons pandas is slow, and once you know them, every “this script is taking 40 minutes” turns into “ah, it’s flavor 2.” They are, in rough order of how often I see them:

  1. Bad dtypes. A column that should be category is object. A column of integers got bumped to float64 because of one NaN. A 5-million-row string column lives as Python object strings instead of string[pyarrow]. Memory blows up, every operation is slower than it should be, and the GC starts thrashing.
  2. apply() with a Python lambda. This is the big one. df["col"].apply(lambda x: ...) is a Python for loop in disguise — every row crosses the C/Python boundary, the GIL is held, and you’ve thrown away every advantage of using a DataFrame in the first place.
  3. Working on the whole DataFrame when you only need a slice. Loading 200 columns to use 6 of them. Filtering after a groupby instead of before. Reading 10 million rows into memory to compute one number.
  4. Single-threaded operations on data that would parallelize. Pandas, by design, runs on one core. If your laptop has 16 cores and you’re crunching for an hour, fifteen of them are bored.
  5. Too much data for one machine. This is the wall. Pandas needs roughly 5-10x the data size in RAM during operations (intermediate copies, joins, groupby buffers). For 1 GB of data on a 32 GB machine you’re fine. For 50 GB of data on a 64 GB machine you’re not, and no amount of tuning will save you.

The diagnostic loop maps onto these one to one. Let’s run it.

The diagnostic loop

In a Jupyter notebook (or ipython), the workflow is:

%%timeit
result = df.groupby("country")["revenue"].transform(lambda s: s - s.mean())

%%timeit runs the cell several times and gives you a stable wall-clock number. Use it on whichever line is suspect. If the whole script is slow, comment things out until you find the offender; nine times out of ten it’s a single line.

Once you have the slow line, check three things in order:

df.info(memory_usage="deep")

This shows you every column, its dtype, and its memory footprint. Look for:

  • object dtypes that should be category or string[pyarrow].
  • float64 columns that are conceptually integers (probably acquired a NaN somewhere upstream).
  • Any column where the deep memory is wildly out of proportion — a 5 million row string column at 1.5 GB is your bottleneck.

Then look at the operation itself:

  • Is it .apply with a Python lambda? It can almost certainly be vectorized.
  • Is it a groupby over a column that’s still object dtype? Cast to category first; groupby on category is several times faster.
  • Are you copying the frame more than once? Each .copy() and many chained .assign calls allocate.

That triage usually identifies the lever. Now, the five levers themselves.

Lever 1: better dtypes

This is the cheapest win. Cast object string columns with low cardinality to category; cast everything else to string[pyarrow]; turn integer columns that drifted to float back to nullable Int64. A typical script applied to 5 million rows of customer data drops from 4 GB resident memory to 600 MB, and groupbys go 5-10x faster, all from one pass at the top of the script:

df = df.astype({
    "country": "category",
    "currency": "category",
    "tier": "category",
    "customer_id": "string[pyarrow]",
    "name": "string[pyarrow]",
    "order_id": "Int64",
})

Or, much better, declare it on read so you never pay the bad-dtype tax in the first place (dtype= on read_csv, or just use Parquet, which carries dtypes natively).

Lever 2: vectorize

If you have one optimization to learn from this lesson, learn this. apply with a Python lambda is the slowest thing in pandas, and it can almost always be replaced by something twenty to a hundred times faster.

The pattern is: whatever your lambda does, find the equivalent column-level operation and call that. Concrete examples.

Conditional column. People often write:

df["band"] = df["age"].apply(lambda x: "young" if x < 30 else "old")

Replace with np.where or pd.cut:

import numpy as np
df["band"] = np.where(df["age"] < 30, "young", "old")

For multiple bands, pd.cut or np.select:

df["band"] = pd.cut(df["age"], bins=[0, 18, 30, 60, 200], labels=["minor", "young", "adult", "senior"])

Both are vectorized, both are fast, both run in C.

String operations. People write:

df["domain"] = df["email"].apply(lambda s: s.split("@")[1].lower())

Pandas has a string accessor for exactly this:

df["domain"] = df["email"].str.split("@").str[1].str.lower()

The .str accessor maps onto vectorized string kernels. With string[pyarrow] dtype, those kernels run in Arrow’s compiled string code and are roughly ten times faster than the Python equivalent.

Math on multiple columns. People write:

df["score"] = df.apply(lambda r: r["a"] * 0.5 + r["b"] * 0.3 + r["c"] * 0.2, axis=1)

That’s axis=1, the worst form of apply because it iterates row by row through Python. Vectorize:

df["score"] = df["a"] * 0.5 + df["b"] * 0.3 + df["c"] * 0.2

A hundred times faster. There is essentially no excuse for df.apply(..., axis=1) in 2026; if you find yourself reaching for it, stop and think about which column-level op would do the same thing.

The one legitimate place for apply is when the per-row computation is genuinely complex (call an API, run a model, do something the vectorized world can’t express). And even then — batch it, run it once on the unique values, and join back.

Lever 3: chunked reads

If your data doesn’t fit in memory but the operation is per-row or aggregable, stream it. Every pd.read_* accepts chunksize=:

totals: dict[str, float] = {}
for chunk in pd.read_csv("huge.csv", chunksize=500_000, dtype_backend="pyarrow"):
    for country, sub in chunk.groupby("country"):
        totals[country] = totals.get(country, 0) + sub["revenue"].sum()

Each chunk is a regular DataFrame; you process it and move on. Memory stays bounded at one chunk’s size, regardless of file size. This works for sums, counts, top-K, anything aggregable. It does not work for operations that need to see the whole frame — sorting the entire file, computing a global percentile — for those, lever 5.

Lever 4: query() and eval()

For long arithmetic chains, df.eval and df.query use numexpr under the hood, which evaluates the whole expression in C, in parallel, without intermediate allocations:

# Instead of
df["score"] = (df["a"] + df["b"]) * df["c"] - df["d"] / df["e"]

# Use
df = df.eval("score = (a + b) * c - d / e")
# Instead of
result = df[(df["age"] > 30) & (df["country"] == "IT") & (df["revenue"] > 1000)]

# Use
result = df.query("age > 30 and country == 'IT' and revenue > 1000")

The speedup is modest (typically 2-3x) and you lose a little IDE help, so don’t reach for these for short expressions. For chains of five-plus operations on big frames they’re worth it.

Lever 5: leave pandas

Sometimes the right answer isn’t to optimize the pandas code; it’s to admit that the problem has outgrown the tool. The two main exits in 2026 are:

  • Polars for the same in-memory DataFrame work but with a Rust core, multithreading, and a query optimizer. The lazy API can run queries on data that doesn’t fit in memory via streaming. This is the topic of lesson 35.
  • DuckDB for SQL-style analytical queries against Parquet files or pandas frames. duckdb.query("SELECT country, SUM(revenue) FROM 'data/*.parquet' GROUP BY 1").to_df() runs at roughly the speed of a column-store database, on your laptop, with zero setup.

A rough heuristic: if your data fits comfortably in RAM (under, say, 5 GB on a 32 GB machine) and the script runs in under a minute, stick with pandas. If the script takes 10+ minutes and you’ve already pulled levers 1-4, switch. If the data doesn’t fit at all, you don’t have a choice — Polars in streaming mode or DuckDB.

On AI assistants and pandas optimization

This is one of the cases where AI assistants are reliably useful. Paste a slow function into Claude or ChatGPT with the prompt “this is slow, find the bottleneck and rewrite it” and the suggestions are usually right: change apply to np.where, cast this column to category, replace the merge with a join on the index. The model has seen ten thousand pandas Stack Overflow answers and the patterns are mechanical.

Where the assistants are less reliable is the architectural question — when to abandon pandas entirely. They’ll happily optimize a script that should be a Polars query or a DuckDB call, gaining you 3x when a rewrite would gain you 30x. That judgement call is still yours, and the heuristic above is the start of it.

Memory: the wall

The five-to-ten-times rule is worth repeating because it surprises people. A 1 GB CSV becomes a roughly 1.5-3 GB DataFrame in memory (pandas dtypes are wider than CSV bytes, but Arrow-backed strings narrow this). A groupby on it allocates intermediate buffers that double or triple that. A merge with another frame can allocate the cross-product. So:

  • 1 GB data, 32 GB RAM: comfortable.
  • 5 GB data, 32 GB RAM: tight, especially with merges.
  • 10 GB data, 32 GB RAM: you’ll hit swap; performance falls off a cliff.
  • 10 GB data, 64 GB RAM: workable.
  • 50 GB data, any laptop: not pandas.

The wall is real and it’s not a configuration issue — it’s the cost of holding everything in RAM at once. The way out is either a streaming engine (Polars scan_* with collect(streaming=True), DuckDB) or a distributed engine (Spark, Dask), and we’ll see the first of those next lesson.

What’s next

Lesson 35 is the Polars chapter — what it does differently, why the lazy API is the actual point, and how to read it if you already think in pandas. After that, the project lesson, and then Module 6 closes.

Further reading

See you Tuesday.

Search