PySpark, from the ground up Lesson 13 / 60

Schemas: explicit vs inferred

When to let Spark infer, when to declare your own, and why production code basically always declares.

A DataFrame is two things bolted together: a distributed bag of rows, and a schema — the column names and types that say what those rows look like. The data without the schema is just opaque bytes. The schema without the data is a contract. Spark needs both, every time.

Today’s lesson is about where that schema comes from. There are exactly two answers — Spark guesses it for you, or you write it down — and the choice between them is one of those small decisions that makes the difference between a notebook that runs in 30 seconds and a production job that runs in 30 minutes.

What’s in a schema

df.schema is a StructType. A StructType is just a list of StructField objects, each with three things: a name, a data type, and a nullable flag. That’s it. There’s no magic, no metadata layer, no hidden state — when Spark plans a query, it walks this object to figure out what columns exist and how to read them.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("schemas").getOrCreate()

df = spark.read.csv("orders.csv", header=True, inferSchema=True)
df.printSchema()
# root
#  |-- order_id: integer (nullable = true)
#  |-- customer_id: integer (nullable = true)
#  |-- amount: double (nullable = true)
#  |-- ts: timestamp (nullable = true)

printSchema() gives you the human-readable tree. df.schema returns the actual StructType object — useful when you want to reuse a schema or compare two DataFrames programmatically. df.dtypes returns a list of (name, type_string) tuples, which is the easiest form to copy into a script.

df.dtypes
# [('order_id', 'int'),
#  ('customer_id', 'int'),
#  ('amount', 'double'),
#  ('ts', 'timestamp')]

Three views of the same information. Use whichever fits the moment.

A small but useful detail: schemas are positional in some operations and by-name in others. When you do union, Spark matches columns by position by default, so two DataFrames with the same column names in different orders will silently produce a garbage union. The fix is unionByName. We’ll get to that in the joins lesson, but it’s a reminder that the schema is not a passive label — it’s how Spark identifies columns at every step of the plan.

Inferred schemas: cheap when small, brutal when large

Spark can figure out the schema of a CSV or JSON file by reading it. For CSV, you ask for this with inferSchema=True:

df = spark.read \
    .option("header", True) \
    .option("inferSchema", True) \
    .csv("orders.csv")

What this actually does: Spark reads the entire file once, just to look at the values, decide what type each column should be (int? double? string? timestamp?), throws that result away, and then reads the file again to actually load the data. Two full passes over the input.

For a 10MB exploratory CSV in a notebook, you don’t notice. The whole thing finishes in a second. For a 500GB landing-zone dump on S3, you’ve just doubled your read cost, your read time, and your I/O bill — to learn information you almost certainly already know.

JSON is even worse. There’s no inferSchema=False option for JSON the same way; Spark always infers, because JSON has no header. With nested or deeply variable JSON, the inference pass can scan billions of records to merge schemas across rows.

The other problem with inference, the one that bites in production: inferred types are guesses, and they can change between runs. Suppose your customer_id column is normally numeric. One day a single row arrives with the value "unknown" — somebody’s null sentinel that leaked through the upstream system. Spark sees one string in a column it would have called int, and silently widens the whole column to string. Your downstream code, which was doing df.customer_id + 1, now blows up. Or worse, doesn’t blow up and produces nonsense.

Schema drift is a real failure mode. Inference makes you vulnerable to it.

There’s also the pure cost of being wrong about types. Inference will happily decide a column is int when really it should be bigint, because the sample rows it saw all fit in 32 bits. Then a month later a row with an id over two billion arrives and the cast overflows. Or it decides a date column is string because one row had "N/A". Inference is optimistic in a way that production systems don’t reward.

Explicit schemas: the production default

The fix is to declare the schema yourself and pass it in. Then Spark doesn’t guess — it parses each value into the type you specified, and rows that don’t fit become null (with mode="PERMISSIVE", the default) or fail the read (with mode="FAILFAST").

The verbose form uses StructType and StructField:

from pyspark.sql.types import (
    StructType, StructField,
    IntegerType, DoubleType, StringType, TimestampType,
)

orders_schema = StructType([
    StructField("order_id",    IntegerType(),   nullable=False),
    StructField("customer_id", IntegerType(),   nullable=False),
    StructField("amount",      DoubleType(),    nullable=True),
    StructField("ts",          TimestampType(), nullable=True),
])

df = spark.read \
    .option("header", True) \
    .schema(orders_schema) \
    .csv("orders.csv")

One read pass. Predictable types. If a row has "banana" in the amount column, that single value becomes null; the rest of the row still parses. If you’d rather the job fail loudly when that happens, add .option("mode", "FAILFAST").

The shorter form uses a DDL string. It’s the same information, less ceremony:

orders_schema = "order_id INT, customer_id INT, amount DOUBLE, ts TIMESTAMP"

df = spark.read \
    .option("header", True) \
    .schema(orders_schema) \
    .csv("orders.csv")

I prefer the DDL string for flat schemas. It’s two lines instead of seven, it reads like a CREATE TABLE statement, and it’s diff-friendly — when somebody adds a column, the diff is one line, not five.

For deeply nested data (arrays of structs, maps of structs, etc.), the verbose form is clearer. Use whichever fits.

The nullable flag is mostly documentation

You’ll notice every StructField takes a nullable argument. You might assume that setting it to False makes Spark refuse to read rows where that column is null.

It does not. Or rather, it does sometimes, and not other times, and the behaviour is not what you’d hope.

nullable=False is a hint to the Catalyst optimizer. It tells Spark “you can assume this column is never null when you plan queries against it” — which lets the optimizer skip null-handling code paths. If you actually feed in a null where you promised there wouldn’t be one, you don’t get a clean error. You get undefined behaviour: sometimes the row is silently dropped, sometimes it crashes deep in a code generator, sometimes it sails through.

The takeaway: set nullable=True (the default) on everything unless you’re 100% sure. If you want to enforce non-null, do it explicitly with a filter or a check after the read:

bad_rows = df.filter(df.order_id.isNull())
if bad_rows.count() > 0:
    raise ValueError(f"Got {bad_rows.count()} rows with null order_id")

That’s a check that actually fires. The nullable=False flag is a contract Spark won’t enforce for you.

When inferring is fine

I’m not saying never infer. There are cases where it’s the right tool:

  • One-off exploration in a notebook. You got a file, you don’t know what’s in it, you want to look. Inference is faster than typing out a schema for data you’re going to throw away in ten minutes.
  • Files small enough that the double-read doesn’t matter. Under a hundred megabytes, on local disk, you’ll spend more time writing the schema than running the second pass.
  • Parquet, ORC, Delta, Avro. These formats embed the schema in the file itself. Spark reads it from the footer for essentially zero cost. There’s no inference step. Reading Parquet “without a schema” is fundamentally different from reading CSV without a schema. No two-pass penalty, no drift risk.

So the rule is narrower than “always declare.” It’s:

For schema-less text formats (CSV, JSON) in production, always declare. For self-describing binary formats (Parquet, ORC, Delta, Avro), let the file tell you.

This is one of the reasons the data-engineering world moves CSV/JSON to Parquet as soon as it can. The schema becomes a property of the data, not a thing you have to maintain in a separate Python file that drifts out of sync with reality.

Reading what Spark gave you

When you’re handed a DataFrame that came from somewhere else — a join, a function call, a notebook a colleague started — the three inspection methods all have their place:

df.printSchema()        # tree view, for humans
df.schema               # StructType object, for code
df.dtypes               # list of (name, type) tuples, for quick copy-paste

printSchema() is what you’ll use 80% of the time. It’s the diagnostic. When a downstream operation fails with a type error, this is the first thing to check.

df.schema is what you’ll use when writing reusable code. You can save a schema to disk as JSON and load it back:

schema_json = df.schema.json()
# {"fields":[{"name":"order_id","type":"integer","nullable":false,...}], ...}

# Later, somewhere else:
from pyspark.sql.types import StructType
import json
restored = StructType.fromJson(json.loads(schema_json))

This is how schema-registry tools and contract-checking layers work under the hood. Good to know exists; you won’t write it by hand often.

A pattern worth knowing: when a job’s schema is the contract between two teams, store the schema as a .json file checked into Git, alongside the code that reads the data. Reads load the schema from disk via StructType.fromJson(...). Schema changes become Git diffs that go through code review like any other change. This is how you turn “the schema” from tribal knowledge into something a new hire can find on day one.

Run this on your own machine

Drop this into a notebook or a script. It’ll generate a small CSV, then read it three different ways so you can see the difference yourself.

from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StructType, StructField,
    IntegerType, DoubleType, StringType, TimestampType,
)
from pathlib import Path

spark = SparkSession.builder.appName("schemas-demo").getOrCreate()

# Make a tiny CSV with a deliberately tricky row
csv_path = Path("/tmp/orders_demo.csv")
csv_path.write_text(
    "order_id,customer_id,amount,ts\n"
    "1,42,59.00,2026-01-03 10:32:00\n"
    "2,42,29.00,2026-01-04 14:22:00\n"
    "3,17,banana,2026-01-05 09:15:00\n"   # bad amount
    "4,8,149.00,2026-01-06 11:40:00\n"
)

# 1. No schema, no inference: everything is a string
df1 = spark.read.option("header", True).csv(str(csv_path))
print("=== Default (everything string) ===")
df1.printSchema()
df1.show()

# 2. Inferred: Spark guesses, and guesses wrong because of 'banana'
df2 = spark.read.option("header", True).option("inferSchema", True).csv(str(csv_path))
print("=== Inferred ===")
df2.printSchema()
df2.show()
# Notice: amount becomes string, because of one bad row.

# 3. Explicit DDL string: amount is DOUBLE, the bad row becomes null
schema_ddl = "order_id INT, customer_id INT, amount DOUBLE, ts TIMESTAMP"
df3 = spark.read.option("header", True).schema(schema_ddl).csv(str(csv_path))
print("=== Explicit (DDL) ===")
df3.printSchema()
df3.show()

# 4. Explicit StructType: same result, more verbose
schema_obj = StructType([
    StructField("order_id",    IntegerType()),
    StructField("customer_id", IntegerType()),
    StructField("amount",      DoubleType()),
    StructField("ts",          TimestampType()),
])
df4 = spark.read.option("header", True).schema(schema_obj).csv(str(csv_path))
print("=== Explicit (StructType) ===")
df4.printSchema()
df4.show()

Run it. Look at what type amount got in each case. The inferred version widening to string because of one bad row is the entire production argument for declaring schemas, in eight visible characters.

Next lesson: select and filter, the two operations you’ll do thousands of times — including the four different ways to refer to a column, only three of which are safe.


Reference: Apache Spark Python API (https://spark.apache.org/docs/latest/api/python/), retrieved 2026-05-01.

Search