We’ve spent ten lessons inside pandas, which is how most production Python data work in 2026 still happens — the ecosystem moat is enormous, scikit-learn wants a DataFrame, every notebook on Kaggle starts with import pandas as pd. But pandas is not the only DataFrame library anymore, and for some kinds of work it’s no longer the best one. Today we meet Polars, the second-generation DataFrame library, and learn enough to read its code, write a pipeline in it, and judge when to reach for it.
Where Polars came from
Polars was started by Ritchie Vink, a Dutch engineer, in 2020. He’d been frustrated with pandas performance on the kind of medium-large data (10-100 GB) that’s common in industry but awkward for pandas — too big to be comfortable, too small to justify Spark. He wrote the core in Rust, with Apache Arrow as the in-memory format, and a Python binding on top. The first 1.0 release was in 2024; by 2026 Polars is on version 1.x, mature, and is the default DataFrame library for a meaningful chunk of new Python data work.
The four design decisions that make Polars different from pandas:
- Rust core, parallel by default. Every operation that can use multiple cores does, with no configuration. On a 16-core laptop, a Polars groupby is roughly 8-15x faster than the same pandas groupby, and most of that comes from the cores.
- Arrow memory format throughout. No NumPy fallback; everything is columnar Arrow. Strings, dates, missing values — all native, all efficient.
- A lazy API with a query optimizer. This is the bit that sets Polars apart from “fast pandas.” More on this in a moment.
- No index. Polars frames are just rectangles of columns. There’s no
.set_index, no row labels, no multi-index hierarchy. Everything that pandas does with the index, Polars does with explicit columns. After the first day of withdrawal, this is a relief.
The two APIs: eager and lazy
Polars has two ways of expressing the same computation, and the difference between them is the most important thing to understand about the library.
Eager mode is what pandas does: every operation runs immediately and returns a new DataFrame. It’s natural for exploration in a notebook:
import polars as pl
df = pl.read_csv("sales.csv")
filtered = df.filter(pl.col("amount") > 100)
grouped = filtered.group_by("country").agg(pl.col("amount").sum())
Each line ran. Each line allocated. The execution order is exactly what you typed.
Lazy mode is the one that matters. Instead of running each step, you build a query plan, hand it to Polars, and let the planner optimize before executing:
result = (
pl.scan_csv("sales.csv") # scan, not read — no I/O yet
.filter(pl.col("amount") > 100)
.group_by("country")
.agg(pl.col("amount").sum())
.collect() # NOW it runs
)
Note scan_csv instead of read_csv and .collect() at the end. Until collect(), nothing executes — Polars has only built a plan. Then the planner looks at the whole plan and rewrites it: it pushes the filter down into the CSV reader (so rows with amount <= 100 never get parsed in the first place), it figures out which columns are actually used and reads only those, it picks the best parallel strategy, and then it runs.
This is the same trick Spark and most modern SQL engines use. The mental shift for pandas users is real: in pandas, you optimize by hand-tuning the order of operations. In Polars lazy mode, you write the operations in the order that’s clearest, and the planner reorders them.
The rule of thumb: production code should use lazy. Notebooks and exploration use eager. The transition from one to the other is small — change read_* to scan_* and add .collect() at the end — but the speedup on real data is often 5-10x on top of Polars’s already-being-fast.
The expression API
The other big shift is how columns are referenced. In pandas, df["x"] gives you a Series; you can pass it around, do math on it, and assign it back. In Polars, you use pl.col("x") inside expressions:
df.with_columns(
(pl.col("revenue") * 1.22).alias("revenue_with_vat"),
pl.col("country").str.to_uppercase().alias("country_upper"),
)
pl.col("revenue") is not a column value — it’s an expression that says “the column called revenue, in whatever frame this gets applied to.” It’s a description, not a fetch. That’s what lets the planner reason about the query.
This is also why df.with_columns(...) instead of pandas’s df["new"] = .... Polars frames are immutable; every operation returns a new frame. You don’t mutate, you derive.
A few common expression patterns:
# Filter
df.filter(pl.col("age") > 30)
df.filter((pl.col("age") > 30) & (pl.col("country") == "IT"))
# Add columns
df.with_columns([
(pl.col("a") + pl.col("b")).alias("sum"),
pl.col("name").str.to_lowercase().alias("name_lower"),
])
# Group and aggregate
df.group_by("country").agg([
pl.col("revenue").sum().alias("total"),
pl.col("revenue").mean().alias("avg"),
pl.col("customer_id").n_unique().alias("customers"),
])
# Sort
df.sort("revenue", descending=True)
# Join (no index, so always explicit on=)
left.join(right, on="customer_id", how="inner")
Notice every aggregate is named explicitly with .alias(...). Polars doesn’t auto-name; you say what you want.
A pandas-to-Polars cheat sheet
The mappings most often needed, side by side:
| Task | pandas | Polars |
|---|---|---|
| Read CSV | pd.read_csv("f.csv") | pl.read_csv("f.csv") / pl.scan_csv(...) |
| Read Parquet | pd.read_parquet(...) | pl.read_parquet(...) / pl.scan_parquet(...) |
| Select columns | df[["a", "b"]] | df.select(["a", "b"]) |
| Filter rows | df[df["a"] > 5] | df.filter(pl.col("a") > 5) |
| Add column | df["c"] = df["a"] + df["b"] | df.with_columns((pl.col("a") + pl.col("b")).alias("c")) |
| Drop column | df.drop(columns=["a"]) | df.drop("a") |
| Group + sum | df.groupby("k")["x"].sum() | df.group_by("k").agg(pl.col("x").sum()) |
| Sort | df.sort_values("x") | df.sort("x") |
| Join | df.merge(other, on="k") | df.join(other, on="k") |
| Rename | df.rename(columns={"a": "b"}) | df.rename({"a": "b"}) |
| Date part | df["d"].dt.year | df.with_columns(pl.col("d").dt.year()) |
| Null fill | df["x"].fillna(0) | df.with_columns(pl.col("x").fill_null(0)) |
| Apply (avoid) | df["x"].apply(f) | df.with_columns(pl.col("x").map_elements(f)) |
| To pandas | — | df.to_pandas() |
| From pandas | — | pl.from_pandas(pdf) |
A note on apply: in Polars it’s map_elements, and the library will give you a runtime warning if you use it because per-row Python is even more obviously the wrong answer here than in pandas — Polars expressions cover almost everything you’d want.
Streaming larger-than-memory data
This is the second big Polars trick. collect(streaming=True) runs the lazy plan in a streaming fashion, processing data in chunks under the hood without ever materializing the whole frame:
result = (
pl.scan_parquet("data/year=2025/*.parquet") # 80 GB across files
.filter(pl.col("country") == "IT")
.group_by("month")
.agg(pl.col("revenue").sum())
.collect(streaming=True)
)
That’s the shape of a query that would absolutely melt pandas (80 GB on a laptop), and Polars runs it in a few minutes with bounded memory. Not every operation supports streaming yet — joins on huge frames, some window functions — but the common case (filter, group, aggregate) does, and the coverage has been widening with every release.
Working with both libraries
The conversion between Polars and pandas is cheap, because both speak Arrow:
# Polars frame to pandas
pdf = df.to_pandas(use_pyarrow_extension_array=True)
# Pandas frame to Polars
df = pl.from_pandas(pdf)
With use_pyarrow_extension_array=True, the conversion is zero-copy — same memory, different view. So the practical pattern in 2026 is: use Polars for the heavy data work (load, filter, aggregate, transform), convert to pandas only when you need to feed sklearn or a plotting library that doesn’t speak Polars yet:
features = (
pl.scan_parquet("training_data.parquet")
.filter(pl.col("year") >= 2024)
.group_by("user_id")
.agg([
pl.col("revenue").sum().alias("total_rev"),
pl.col("session_count").mean().alias("avg_sessions"),
])
.collect()
)
# Hand off to scikit-learn
X = features.to_pandas()
model.fit(X.drop(columns=["user_id"]), y)
This is the pragmatic path: Polars for the data engineering, pandas for the model handoff.
When Polars wins, when pandas wins
Polars wins when:
- The data is medium-large (1 GB and up). The Rust core and parallelism dominate.
- The code is a pipeline — load, filter, group, aggregate, write — where the lazy planner can do real work.
- You’re starting fresh and don’t have to integrate with a pandas-only library.
- The data doesn’t fit in memory, and you can use streaming.
Pandas still wins when:
- You need scikit-learn / statsmodels / something old that wants a pandas DataFrame as input.
- Doing rough exploration in a notebook on small data; the pandas API is more forgiving for one-off questions.
- The codebase is already pandas and the speedup wouldn’t justify the rewrite. Mixed codebases get confusing.
- A library you depend on (a niche connector, a domain SDK) returns a pandas frame and you don’t want to add the conversion friction everywhere.
In practice my recommendation in 2026 is: new pipelines, Polars; existing pandas codebases, leave alone unless they’re hitting a performance wall; notebook exploration, whichever you think in faster — and increasingly that’s Polars too, once the muscle memory clicks.
A few things that bite pandas users
Three friction points worth flagging, because they’re where a pandas-fluent person spends their first afternoon of Polars confusion:
Column references aren’t strings, they’re expressions. In pandas you can do df.groupby("country")["revenue"].sum() because column names are strings indexing a frame. In Polars, the same operation is df.group_by("country").agg(pl.col("revenue").sum()) — pl.col("revenue") is the unit you operate on, not the string. Once this clicks, every Polars method signature stops looking strange.
No chained assignment. Pandas’s df["new_col"] = ... mutates the frame. Polars frames are immutable; you must say df = df.with_columns(...) and reassign. This is annoying for one minute and a relief forever after, because the entire class of “I assigned to a copy by accident” bugs disappears.
Joins are explicit on on=. Pandas uses the index for joins by default if you don’t pass on=. Polars has no index, so you always pass on= (or left_on= / right_on=). Slightly more typing, much less guessing.
Null handling is consistent. Every type is nullable, missing values are null (not NaN, not NaT, not None depending on dtype), and pl.col("x").fill_null(0) always works the same way. After years of pandas’s NaN-vs-None-vs-NaT mess, this is a small joy.
A small worked example
The full shape of a Polars script — read, transform, aggregate, write — to give you the rhythm:
import polars as pl
result = (
pl.scan_parquet("orders/*.parquet")
.filter(pl.col("status") == "completed")
.with_columns([
(pl.col("quantity") * pl.col("unit_price")).alias("revenue"),
pl.col("created_at").dt.month().alias("month"),
])
.group_by(["country", "month"])
.agg([
pl.col("revenue").sum().alias("total_revenue"),
pl.col("order_id").n_unique().alias("orders"),
pl.col("customer_id").n_unique().alias("customers"),
])
.sort(["country", "month"])
.collect(streaming=True)
)
result.write_parquet("output/monthly_summary.parquet")
Read it once: scan, filter, derive two columns, group, aggregate three things, sort, run streaming, write. There is no temporary frame, no intermediate variable, no .copy(). The whole computation is one expression, the planner sees it as one expression, and it runs as fast as the disk and your CPUs allow.
What’s next
Lesson 36 closes the module with an end-to-end project — load a real dataset, clean it, explore it, answer a question, write the result — using the patterns from the last twelve lessons. After that, Module 7 is data engineering: how to take an analysis script like the ones we’ve been writing and turn it into something that runs on a schedule, handles failure, and you don’t have to babysit.
Further reading
- Polars user guide — the official guide, with sections on lazy evaluation, expressions, and migration from pandas. Retrieved 2026-05-01.
- Polars Python API reference — the full method list. Retrieved 2026-05-01.
See you Friday for the project.