Python, from the ground up Lesson 33 / 60

Categorical and string types: memory + speed wins

When converting a column to categorical or string[pyarrow] makes the difference between a job that runs and one that runs out of memory.

There’s a moment in every pandas user’s life when a job that worked yesterday on 1 million rows dies today on 50 million. The memory chart climbs, the swap file fills, the kernel kills the process, and you start googling “pandas out of memory.” Nine times out of ten, the problem isn’t that the data is genuinely too big — it’s that the strings are stored as Python objects, one per row, and you’re paying 50 bytes per character for what should be a 4-byte integer code. Fixing the dtypes turns the same dataset from 8 GB into 800 MB.

Today we cover the two dtypes that do most of the work: category and string[pyarrow]. Both are easy to apply, both are huge wins on the right data, and both have one specific situation where they’re the wrong choice.

Why object dtype is expensive

By default, pandas stores string columns as object dtype. Under the hood, that’s a NumPy array of Python object pointers — one pointer per row, each pointing to a separate Python str object on the heap. Each Python string carries about 50 bytes of overhead before its actual character data. A 50-million-row column of, say, country codes ("IT", "US", "DE", etc.) ends up using:

  • 50M × 8 bytes for the pointers = 400 MB
  • 50M × ~55 bytes for the string objects = ~2.75 GB
  • Plus dictionary headers, allocator overhead, fragmentation…

You blow past 3 GB on a column that semantically holds about 200 unique values. There’s no excuse for it once you know the alternatives.

Categorical dtype

The category dtype solves the low-cardinality case. Internally, pandas builds a small dictionary of unique values and stores the data as integer codes pointing into the dictionary. For a column with 200 unique countries and 50 million rows:

  • 200 × ~55 bytes for the dictionary entries = ~11 KB
  • 50M × 1 byte (or 2, if you exceed 256 categories) for the codes = ~50 MB

About 60 MB total instead of 3 GB. The savings are not subtle.

Conversion is a one-liner:

df["country"] = df["country"].astype("category")

The new column behaves like a string column for almost everything — comparisons, equality, .str methods, groupby — but uses a fraction of the memory. As a bonus, groupby on a categorical column is notably faster, because pandas can use the integer codes directly instead of hashing strings.

The places it shines:

  • Country / currency / language codes — small fixed sets, used everywhere.
  • Status / state enums"pending", "shipped", "delivered", "cancelled".
  • Product categories — usually a few hundred at most.
  • Boolean-ish strings that someone insisted on storing as "yes"/"no" instead of True/False.

The trap: high-cardinality columns. If the column has roughly as many unique values as rows — user IDs, transaction hashes, free-text comments, URLs — categorical is worse than object. You pay the dictionary cost without getting any savings, and operations on it are slower because pandas has to look up codes in a dictionary that’s the size of the data anyway. Rule of thumb: if the unique-value count is more than about 50% of the row count, don’t use categorical.

How to check before deciding:

unique_ratio = df["col"].nunique() / len(df)
# < 0.05 → categorical is a clear win
# 0.05 to 0.5 → probably worth it; benchmark
# > 0.5 → leave it alone (or use string[pyarrow] for text)

Ordered categoricals

Sometimes a categorical isn’t just a set, it’s a sequence: T-shirt sizes (S < M < L < XL), star ratings (1 < 2 < 3 < 4 < 5), credit grades (AAA < AA < A < BBB…). Pandas supports this directly:

import pandas as pd

sizes = pd.Categorical(
    ["M", "L", "S", "XL", "M"],
    categories=["S", "M", "L", "XL"],
    ordered=True,
)
df["size"] = sizes

Now comparisons respect the order: df["size"] >= "L" returns [False, True, False, True, False]. Sorting works correctly without you having to define a sort key. min and max mean what you’d expect.

This is the right way to handle ordinal data in pandas. The alternative — defining a separate sort-order column or remembering to pass key= to every sort — is the kind of thing that breaks the moment a colleague writes a query without knowing about it.

string[pyarrow] dtype

For high-cardinality text columns where categorical doesn’t help, the answer is string[pyarrow]. It’s a proper string dtype backed by Apache Arrow’s columnar string storage, and it’s strictly better than object dtype for almost every text use case:

df["description"] = df["description"].astype("string[pyarrow]")

What you get:

  • Less memory — Arrow stores strings as a single contiguous byte buffer with offsets, no per-string Python overhead. 30-50% smaller for typical text.
  • Faster string operations.str.contains, .str.lower, .str.replace and friends run as vectorised C code over the Arrow buffer. On object dtype, every call iterates Python objects one by one.
  • Proper missing valuespd.NA instead of the NaN/None mess that object columns have.

The speed difference on .str.contains over millions of rows is typically 5-10x. If your code does any text matching at scale, this dtype alone justifies the change.

In pandas 3.0+ (when it ships), Arrow-backed strings will be the default — there’s an opt-in flag in 2.x:

import pandas as pd
pd.options.future.infer_string = True
# now read_csv etc. produce string[pyarrow] for string columns

Turning that flag on at the top of new scripts is a no-regret move. You get the future default today, and your code is ready for 3.0 with no changes.

Vectorised string methods

Whichever string dtype you’re on, the .str accessor is the right way to manipulate text. It applies the operation to the whole column at C speed (or near it), instead of iterating in Python:

df["email"] = df["email"].str.lower().str.strip()
df["clean_phone"] = df["phone"].str.replace(r"\D", "", regex=True)
df["domain"] = df["email"].str.split("@").str[1]
df["is_corporate"] = df["email"].str.endswith(("@example.com", "@example.org"))
df["name_first_word"] = df["name"].str.extract(r"^(\S+)")

What you should not do — ever — is:

# DON'T
df["clean"] = df["email"].apply(lambda x: x.lower().strip())

apply with a Python lambda iterates row by row through Python. On a million rows it’s a hundred times slower than .str.lower().str.strip(). The vectorised path also handles missing values cleanly; the lambda explodes on the first None you hit.

The full .str API mirrors most of the Python str methods plus regex helpers (.str.contains, .str.extract, .str.findall). It’s worth ten minutes scrolling through the pandas string-methods reference once, just so you know what’s available before you write a one-off lambda.

Picking the right dtype: a checklist

For each string column in a dataset that’s giving you memory or speed pain:

  1. Compute nunique() / len(df).
  2. Ratio under ~5%? Convert to category. Massive memory win, faster groupbys.
  3. Ratio 5-50% and the column is short codes (IDs that repeat some)? Benchmark both — usually categorical still wins, but not by as much.
  4. High cardinality, free text, URLs, emails? Convert to string[pyarrow]. Won’t shrink as dramatically as categorical, but you get vectorised .str operations and proper NA handling.
  5. Ordinal data (sizes, grades, ratings)? pd.Categorical(..., ordered=True).
  6. Numeric-looking strings ("1234", "56.7")? They aren’t strings, they’re numbers. astype("int64") or astype("float64[pyarrow]") and stop pretending.

Run this once per dataset, not per column-by-column instinct. The win comes from the columns you’re not thinking about.

A real example: a 50M-row log file

A web access log: 50 million rows, columns for timestamp, user ID, URL path, HTTP method, status code, user agent, country code, response time. Naive load:

df = pd.read_parquet("access.parquet")
df.memory_usage(deep=True).sum() / 1e9
# ~ 8.2 GB

Most of it is the user agent string, the URL path, and country code stored as objects. After dtype optimisation:

df = pd.read_parquet("access.parquet")

# Low-cardinality → category
df["country"] = df["country"].astype("category")
df["method"] = df["method"].astype("category")
df["status"] = df["status"].astype("category")

# High-cardinality text → string[pyarrow]
df["user_agent"] = df["user_agent"].astype("string[pyarrow]")
df["path"] = df["path"].astype("string[pyarrow]")

# Numeric
df["response_ms"] = df["response_ms"].astype("int32")  # was int64; max ~30s, easily fits

df.memory_usage(deep=True).sum() / 1e9
# ~ 0.8 GB

A 10x reduction. The same script that swapped to disk and died now runs comfortably on a laptop. And because the new dtypes are also faster — categorical groupbys, vectorised .str.contains for finding bot traffic — the analysis itself runs in a fraction of the time.

This is the difference between “we need a bigger machine” and “we need to read the dtype docs for half an hour.” Almost always, it’s the second.

What’s next

That closes Module 6. Module 7 starts with groupby proper — the apply / transform / aggregate trio, the patterns that come up over and over in analytical code, and the performance corners worth knowing. After two months of pandas you’ll be writing code that is shorter, faster, and a lot more pleasant than what you started with.

Further reading

See you Friday.

Search