PySpark, from the ground up Lesson 14 / 60

Select and filter: the two operations you'll do thousands of times

select, where, filter, and the four ways to refer to a column — including the one that breaks when you have spaces in column names.

Almost every PySpark program is built out of two verbs: pick some columns, keep some rows. Everything else — joins, aggregations, windows, UDFs — is decoration on top. So before we get to any of that decoration, we should be very comfortable with the two basic verbs.

In PySpark these are select (pick columns) and filter/where (keep rows). They’re not exotic; if you’ve written SQL, the mental model maps cleanly. The interesting part is how you refer to a column, because PySpark gives you four different ways to do it, and only three of them survive contact with real-world data.

select: projecting columns

select takes any number of column references and returns a new DataFrame with just those columns:

df.select("order_id", "amount").show(5)

You can pass strings, column objects, or expressions. They mix freely:

from pyspark.sql.functions import col, upper

df.select(
    "order_id",
    upper(col("country")).alias("country_code"),
    (col("amount") * 1.22).alias("amount_with_vat"),
).show(5)

The output schema is exactly the columns you asked for, in the order you asked. No surprises.

select is a transformation, not an action — it doesn’t run anything until you trigger it (show, collect, write). All it builds is a lazy plan node that says “when somebody finally executes me, project these columns.”

where and filter: keeping rows

where and filter are aliases. Same method, two names. Both take a Boolean column expression and keep only rows where the expression is true:

df.filter(col("amount") > 100).show(5)
df.where(col("amount") > 100).show(5)   # identical

Pick one and stick with it. I lean toward where because it reads like SQL. Some teams prefer filter because it reads like Python. Neither is wrong; consistency inside one codebase matters more than the choice itself.

You can also pass a SQL string fragment if you want, which can be handy when copy-pasting from a SQL editor:

df.where("amount > 100 AND country = 'IT'").show(5)

This works, runs through the same Catalyst optimizer, and produces the same plan as the column-expression form. Use it when it’s clearly more readable; otherwise stick with column expressions, because they’re checked at parse time and they autocomplete in your IDE.

Four ways to reference a column

Here’s where PySpark gets interesting. Suppose your DataFrame has a column called amount. You can refer to it as:

"amount"            # bare string
df.amount           # attribute access
df["amount"]        # item access
col("amount")       # function call (from pyspark.sql.functions)

All four work in select. They are not equivalent everywhere else. Here’s the cheat sheet:

1. Bare string "amount" — works in select, groupBy, orderBy, drop, dropDuplicates. Doesn’t work where Spark expects a Column object — for example, you can’t write df.where("amount" > 100), because Python evaluates "amount" > 100 as a string-vs-int comparison before Spark ever sees it. That’s a Python error, not a Spark one.

2. Attribute access df.amount — works most of the time, fails in three situations. (a) The column name has a space or punctuation: df.amount paid is a syntax error, df["amount paid"] works. (b) The column name shadows a DataFrame method: df.count doesn’t reference the column count, it references the count() method. Confusing and silent. (c) The column name has weird casing or starts with a digit.

3. Item access df["amount"] — works always, on any column name, no matter how cursed. This is the one I reach for when I’m not sure. It’s slightly noisier than attribute access but it never surprises me.

4. The col() function — works always, and uniquely, works without a DataFrame in scope. This matters in two contexts:

from pyspark.sql.functions import col, when, sum as _sum

# Inside agg(): there's no df.x to write, you must use col()
df.groupBy("country").agg(_sum(col("amount")).alias("total"))

# When building reusable expressions in a function
def positive_amount():
    return col("amount") > 0

df.where(positive_amount())

You’ll use col() constantly once you start writing functions that operate on columns generically. Get used to it.

The recommendation, if you want one: use col() in expressions, df["x"] when you need to be unambiguous about which DataFrame, and bare strings only in column-list arguments like select and groupBy. Avoid df.x in shared code because the failure modes are silent.

Boolean operators: the precedence trap

PySpark’s column expressions overload Python’s bitwise operators (&, |, ~) for Boolean logic. They are not the same as and, or, not. This is the single most common PySpark beginner mistake.

# WRONG — Python's `and` doesn't know what a Column is
df.where(col("amount") > 100 and col("country") == "IT")
# Raises: ValueError: Cannot convert column into bool

# RIGHT
df.where((col("amount") > 100) & (col("country") == "IT"))

And you absolutely need the parentheses around each comparison. Why? Because & and | have higher precedence than >, ==, <, etc. — in Python’s grammar, & binds tighter than the comparison operators. Without parens, Python parses your expression in the wrong order and you get a confusing error or, worse, a silently wrong filter.

# WRONG — parsed as col("amount") > (100 & col("country")) == "IT"
df.where(col("amount") > 100 & col("country") == "IT")

# RIGHT
df.where((col("amount") > 100) & (col("country") == "IT"))

Train your fingers to type the parens automatically. Every comparison in a compound filter wraps in parens. No exceptions.

~ is negation:

df.where(~(col("country") == "NL"))
# same as
df.where(col("country") != "NL")

Either form works. The ~ form composes nicely when you’re negating a longer expression, like ~col("email").isNull().

selectExpr: when SQL is shorter

There’s a sibling method, selectExpr, that takes SQL string fragments instead of Column expressions. It’s just select with a SQL parser in front:

df.selectExpr(
    "order_id",
    "amount * 1.22 AS amount_with_vat",
    "CASE WHEN amount > 100 THEN 'big' ELSE 'small' END AS bucket",
)

When an expression is short and obviously SQL-shaped, selectExpr is more compact than the equivalent col(...).cast(...) chain. When the expression is being built programmatically, the Column-expression form is better because you can compose it. Choose per situation; both compile to the same plan.

String matching

Four patterns, in order from cheapest to most general:

df.where(col("email").startswith("anne"))   # prefix
df.where(col("email").endswith("@gmail.com")) # suffix
df.where(col("email").contains("anne"))      # substring
df.where(col("name").like("anne %"))         # SQL LIKE — % and _ wildcards
df.where(col("name").rlike("^[Aa]nne"))      # regex (Java syntax)

startswith and endswith are the cheapest because they can sometimes use partition pruning or sort-based skipping. contains is a substring search. like follows SQL’s wildcard rules: % matches any sequence, _ matches one character. rlike is full regex — Java’s regex flavor, which is 95% the same as Python’s re but not identical. Watch out for \\ vs \ escaping when you put a regex in a Python string.

Null filters

isNull() and isNotNull() are methods on a column. The same three-valued-logic rules from SQL apply: a comparison involving null returns null, not true or false, so where(col("x") == None) doesn’t do what you think.

df.where(col("email").isNull()).count()      # rows with no email
df.where(col("email").isNotNull()).count()   # rows with email

# WRONG — None comparison; never matches
df.where(col("email") == None)

Same trap, different language. SQL has = NULL returning null. PySpark has == None returning null. Same fix: use the dedicated null methods.

There’s also eqNullSafe (sometimes written as the <=> operator in Spark SQL), which treats null as a comparable value: null eqNullSafe null is true, null eqNullSafe 1 is false. Useful for join conditions where you want nulls to match nulls — rare in practice, but the only time == doesn’t do what you want and there’s a clean alternative.

Chaining filters

You can chain filters as much as you want. Each one is its own plan node, but Catalyst will collapse them into a single physical filter at execution time:

df.filter(col("country") == "IT") \
  .filter(col("amount") > 100) \
  .filter(col("ts") >= "2026-01-01") \
  .show()

The chained form and the compound form ((country == "IT") & (amount > 100) & (ts >= "...")) produce identical physical plans. Pick whichever reads better. I tend to chain when each filter is a separate “rule” with its own meaning (“Italian customers”, “high-value orders”, “this quarter”) and to use the compound form when it’s one logical condition with multiple parts.

Putting it together

Realistic example: from an orders DataFrame, get the order ID, customer ID, amount, and country code, for orders from Italy, Spain, or France in the first quarter, where the amount is over 50 and the email is not null.

from pyspark.sql.functions import col

result = (
    orders
    .where(col("country").isin("IT", "ES", "FR"))
    .where(col("amount") > 50)
    .where(col("email").isNotNull())
    .where((col("ts") >= "2026-01-01") & (col("ts") < "2026-04-01"))
    .select(
        col("order_id"),
        col("customer_id"),
        col("amount").alias("amount_eur"),
        col("country"),
    )
)

result.show(20)

isin is the equivalent of SQL’s IN. Takes any number of literal values. Don’t try to pass a Python list with * unpacking unless you actually need to — isin("IT", "ES", "FR") is fine; isin(*countries_list) works too if countries_list is dynamic.

Notice the order of operations doesn’t matter for correctness — Catalyst will reorder filters during planning to put the cheapest, most selective ones first, and to push them down to the data source where possible (we’ll get to predicate pushdown in a later lesson). Write your filters in the order that’s clearest to read. The optimizer doesn’t care.

One more practical note: select and filter are both narrow transformations, meaning they don’t shuffle data between executors. They’re cheap. You can scatter them through your pipeline freely, and the optimizer will collapse adjacent ones into a single physical step. This is the opposite of operations like groupBy and join, which we’ll get to in the next few lessons — those do shuffle, and where you put them matters a lot.

Run this on your own machine

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("select-filter-demo").getOrCreate()

data = [
    (1, 42, 59.00,  "IT", "anne@example.com",   "2026-01-03 10:32:00"),
    (2, 42, 29.00,  "IT",  None,                "2026-01-04 14:22:00"),
    (3, 17, 149.00, "DE", "bob@example.com",    "2026-02-05 09:15:00"),
    (4, 17, 199.00, "ES", "bob@example.com",    "2026-02-22 11:40:00"),
    (5,  8,  42.42, "RO",  None,                "2026-03-28 08:12:00"),
    (6, 99,  12.00, "IT", "claire@example.com", "2026-04-15 16:00:00"),
]
schema = "order_id INT, customer_id INT, amount DOUBLE, country STRING, email STRING, ts TIMESTAMP"
orders = spark.createDataFrame(data, schema)

# Q1: simple filter, simple select
orders.where(col("country") == "IT").select("order_id", "amount").show()

# Q2: compound filter — note the parens around each comparison
orders.where(
    (col("amount") > 50) & (col("country").isin("IT", "ES", "FR"))
).show()

# Q3: null-safe filter
orders.where(col("email").isNotNull()).select("order_id", "email").show()

# Q4: the four column-reference styles, all equivalent here
orders.select("amount").show(2)
orders.select(orders.amount).show(2)
orders.select(orders["amount"]).show(2)
orders.select(col("amount")).show(2)

# Q5: a string-pattern filter
orders.where(col("email").rlike("^[ab]")).show()

# Q6: deliberate trap — try removing the parens around the comparisons
# orders.where(col("amount") > 50 & col("country") == "IT")  # FAILS
orders.where((col("amount") > 50) & (col("country") == "IT")).show()

Run each query. Try removing the parens in Q6. Read the error. That error will save you an hour next time you see it.

Next lesson: adding columns with withColumn, the lit helper, and the chaining trap that turns a 50-step pipeline into a tangled execution plan.


Reference: Apache Spark Python API (https://spark.apache.org/docs/latest/api/python/), retrieved 2026-05-01.

Search