Python, from the ground up Lesson 28 / 60

Selection: .loc, .iloc, boolean indexing

The three ways to slice a DataFrame, the differences, and the SettingWithCopyWarning everyone gets bitten by.

You have a DataFrame. Now you want a piece of it — a column, a few rows, the rows where some condition holds, the cell at row 7 column 3. Pandas gives you three ways to do this, and they look similar enough that beginners mash them together until something works. The result is code that runs today and breaks next week when the index changes. Today we untangle the three idioms — bracket, .loc, .iloc — and at the end we look at the SettingWithCopyWarning, the most-Googled message in the pandas world.

We’ll work with one example DataFrame throughout:

import pandas as pd

df = pd.DataFrame(
    {
        "country": ["IT", "DE", "FR", "IT", "DE", "FR"],
        "year":    [2024, 2024, 2024, 2025, 2025, 2025],
        "revenue": [120.0, 340.0, 210.0, 145.0, 380.0, 225.0],
        "orders":  [12, 25, 18, 15, 28, 19],
    }
)

Six rows, four columns, default integer index. Plain enough to reason about, real enough to be useful.

The bracket: convenient and ambiguous

The bare df[...] syntax is the one people learn first because it looks like a Python dict. It does four different things depending on what you put inside:

df["revenue"]                   # one column -> Series
df[["country", "revenue"]]      # list of columns -> DataFrame
df[0:3]                         # row slice -> DataFrame (first 3 rows)
df[df["revenue"] > 200]         # boolean mask -> DataFrame (filtered rows)

That’s four different operations sharing one syntax. It’s convenient at the REPL — you don’t have to remember which to use, just type something and pandas figures it out. It’s also ambiguous: if you have a column literally named 0 or a string-indexed DataFrame, the meaning of df[0:3] shifts. For one-off analysis the bracket is fine. For code you’ll re-read in six months, prefer .loc or .iloc, which mean exactly one thing each.

The other reason to avoid the bracket in pipelines is chaining. df[df["revenue"] > 200]["orders"] = 0 looks like it should set orders to zero on the high-revenue rows. It doesn’t, reliably. We’ll come back to why in a minute.

.loc — label-based selection

.loc indexes by label: by the value of the index for rows, by the value of the column name for columns. Two arguments, separated by a comma:

df.loc[row_labels, col_labels]

For our default integer index, the row labels happen to coincide with row positions — df.loc[2] is the third row because its label is 2. That equivalence is the trap: it works until you do df = df.sort_values("revenue") and now df.loc[2] is wherever the row with original label 2 ended up, not the third row. Always think of .loc as “by label,” even when the labels look like positions.

Concrete forms:

df.loc[0]                              # row with label 0 (Series)
df.loc[[0, 2, 4]]                      # rows with these labels (DataFrame)
df.loc[1:3]                            # rows from label 1 to label 3 INCLUSIVE
df.loc[:, "revenue"]                   # all rows, one column (Series)
df.loc[:, ["country", "revenue"]]      # all rows, two columns (DataFrame)
df.loc[0:2, "country":"revenue"]       # rectangular slice, both ends inclusive

Two things to internalise:

Slices are inclusive on both ends. df.loc[1:3] returns labels 1, 2, and 3. This is the opposite of every other Python slice you’ve ever seen, and it’s deliberate: with arbitrary index values (strings, dates), “stop just before this label” is rarely what you want. It’s still surprising the first time.

Boolean masks live on the row axis. This is the production pattern:

mask = df["revenue"] > 200
df.loc[mask, "orders"]                 # orders column for high-revenue rows
df.loc[mask, ["country", "orders"]]    # two columns for high-revenue rows
df.loc[mask, "orders"] = 0             # ASSIGNMENT (this is the safe form)

The mask must be a boolean Series aligned to the row index. .loc accepts it positionally — first argument, row axis — and you can combine it with a column label or list on the second axis. This is the form to memorise; it’s how 80% of real-world filtering should look.

.iloc — position-based selection

.iloc is the other one. Same shape, different rules: it indexes by integer position, not label. Slices behave like normal Python slices (half-open):

df.iloc[0]                  # first row (Series)
df.iloc[-1]                 # last row
df.iloc[0:3]                # first 3 rows (stop is exclusive, like a list)
df.iloc[:, 0]               # first column (Series)
df.iloc[0:3, 0:2]           # first 3 rows, first 2 columns
df.iloc[[0, 2, 4]]          # rows at these positions

Use .iloc when you genuinely mean “the third row” regardless of what its label is — the first row of a sorted view, the last 100 rows, every other row. Use .loc when you mean a specific labelled value, including any boolean mask.

A useful rule: .iloc for positional plumbing (head, tail, sampling), .loc for everything that’s about what’s in the data.

Boolean indexing — the everyday filter

A boolean Series of the same length as the row axis becomes a row filter:

df.loc[df["revenue"] > 200]
df.loc[df["country"] == "IT"]
df.loc[df["country"].isin(["IT", "FR"])]
df.loc[df["revenue"].between(150, 300)]
df.loc[~df["country"].isin(["DE"])]    # ~ is "not"

Combine masks with & (and), | (or), ~ (not). Use parentheses, because & binds tighter than > and you’ll get a confusing error otherwise:

df.loc[(df["revenue"] > 200) & (df["country"] == "DE")]
df.loc[(df["country"] == "IT") | (df["country"] == "FR")]

If you forget the parens, pandas raises a TypeError mentioning bitwise operations on bool. That’s the message; the fix is always the parens.

For NaN, the mask predicates have specific helpers:

df.loc[df["revenue"].isna()]
df.loc[df["revenue"].notna()]

Don’t use df["revenue"] == None or df["revenue"] != float("nan")NaN doesn’t equal anything, including itself. isna/notna is the only correct way.

.query — when masks get noisy

If you’re stacking three or four conditions, the parens get tedious. df.query accepts a SQL-flavoured string:

df.query("revenue > 200 and country == 'DE'")
df.query("country in ['IT', 'FR'] and year == 2025")

A few rules:

  • String literals are single-quoted (because the whole expression is double-quoted).
  • Column names with spaces or dots need backticks: df.query("order count > 5").
  • Reference Python variables with @: threshold = 200; df.query("revenue > @threshold").

.query is a real win on readability for medium-sized filters. For one-condition filters, .loc[mask] is shorter; for many-condition filters, .query reads like the SQL the analyst is trying to write anyway.

Setting values — the right pattern

Reading is half the story. Writing is where the warnings start.

The right pattern is df.loc[mask, col] = value:

df.loc[df["revenue"] > 300, "country"] = "DE-PREMIUM"
df.loc[df["country"] == "IT", "orders"] *= 2
df.loc[df["revenue"].isna(), "revenue"] = 0

Single .loc call, two arguments, assignment. Pandas knows it’s a write, mutates the original DataFrame in place, and there’s no ambiguity about whether you’re modifying a view or a copy.

The wrong pattern — the one that produces the famous warning — is chained indexing:

df[df["revenue"] > 300]["country"] = "DE-PREMIUM"   # DON'T

This calls df[df["revenue"] > 300] (which returns a new filtered DataFrame, possibly a view, possibly a copy) and then assigns to ["country"] on that. Pandas can’t tell if your write is hitting the original df or a temporary copy that’s about to be garbage-collected. Historically, sometimes it worked, sometimes it didn’t, depending on the dtype layout. So pandas threw SettingWithCopyWarning and crossed its fingers.

SettingWithCopyWarning, briefly

The warning text is famously confusing:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer, col_indexer] = value instead.

What it means: you used chained indexing, and pandas isn’t sure if your assignment took. The fix is what the warning literally tells you — collapse the chain into a single .loc[mask, col] = value.

The deep cause: NumPy arrays back DataFrames, and slicing a NumPy array sometimes returns a view (writes propagate to the original) and sometimes returns a copy (writes don’t). Pandas couldn’t tell from inside, so it warned defensively whenever the structure looked dangerous.

The pandas 2.0+ fix: copy-on-write. Pandas now has a copy-on-write mode that resolves the ambiguity by guaranteeing every operation behaves as if it returns a copy: writes always go to the object you wrote to, never silently to its parent. As of pandas 2.2 it’s opt-in via pd.set_option("mode.copy_on_write", True); in pandas 3.0 it becomes the default. In 2026, turning it on at the top of any new project is the right call:

import pandas as pd

pd.set_option("mode.copy_on_write", True)

With CoW on, the chained-indexing pattern simply doesn’t write anywhere useful (it modifies the throwaway copy and the original is untouched), and pandas emits a ChainedAssignmentError instead of a warning — which is much harder to ignore. Either way, the right pattern is unchanged: df.loc[mask, col] = value.

Adding columns

Two equivalent forms:

df["revenue_per_order"] = df["revenue"] / df["orders"]
df.loc[:, "revenue_per_order"] = df["revenue"] / df["orders"]

For unconditional column creation the bracket form is fine and idiomatic — there’s no chain, just a single assignment. For conditional column creation (set a value only on some rows), use .loc:

df["tier"] = "standard"
df.loc[df["revenue"] > 300, "tier"] = "premium"

Or, for many branches, np.select:

import numpy as np

df["tier"] = np.select(
    [df["revenue"] > 300, df["revenue"] > 150],
    ["premium", "mid"],
    default="standard",
)

np.select is the multi-branch if/elif/else for vectorised assignment.

Putting it together

A real filtering pipeline looks something like:

import pandas as pd

pd.set_option("mode.copy_on_write", True)

df = pd.read_parquet("sales.parquet")

# Filter to the rows we care about.
df = df.loc[
    (df["country"].isin(["IT", "DE", "FR"]))
    & (df["year"] == 2025)
    & df["revenue"].notna()
]

# Add a derived column.
df["revenue_per_order"] = df["revenue"] / df["orders"]

# Conditionally adjust some rows.
df.loc[df["country"] == "IT", "revenue"] *= 1.05  # 5% IT correction

# Pick the columns we want for downstream.
result = df.loc[:, ["country", "revenue", "orders", "revenue_per_order"]]

Every operation is one .loc call, no chaining, no warnings. Read it back: filter, add, adjust, project. That’s the whole vocabulary.

What’s next

Lesson 29 is groupby and aggregation — once you can pick the rows you want, the next thing you want is to summarise them by group. We’ll cover named aggregations, the agg/apply/transform distinction (which trips up almost everyone), and pivot tables.

Further reading

See you Tuesday.

Search