A column gets added upstream. A column gets renamed. A type changes from int to bigint. The Tuesday-morning Slack message: “we deployed a new event payload, can you check your pipelines?”
This is the data engineer’s everyday. Schema evolution is the discipline of keeping pipelines working when the data underneath them changes shape. Different formats handle it with very different grace, and the choice matters more than people realize until the day the new column shows up missing from yesterday’s dashboard.
This lesson is about what changes, why each storage format reacts differently, and what production patterns work in 2026.
Three flavors of change
Before talking about formats, the taxonomy. Schema changes come in three categories:
Backward-compatible — the new schema can read old data. Adding an optional column with a default. Widening a type (int to bigint, float to double). New code reads old files and gets nulls or defaults for the new field. This is the safe one.
Forward-compatible — the old schema can read new data. Dropping a column. Old code reads new files and ignores the dropped field. Less common but matters when consumers update on different schedules than producers.
Breaking — neither direction works without data migration. Renaming a column. Changing a type incompatibly (string to int). Removing a non-optional field. These need a coordinated cutover.
The goal of schema evolution machinery is to make backward-compatible changes free, forward-compatible changes possible, and breaking changes impossible to do accidentally.
Parquet: schema-on-read, format-level evolution
Parquet stores its schema in each file’s footer. When Spark reads a single Parquet file, it reads the file’s schema and uses that.
When Spark reads multiple Parquet files with different schemas, it has a problem. By default, Spark assumes all files in a path share the same schema. It picks one (often the first listed) and uses it for the whole DataFrame. Files with different schemas at best produce wrong results, at worst throw at read time.
There are two layers of help:
mergeSchema
The Parquet reader supports schema merging:
df = (spark.read
.option("mergeSchema", "true")
.parquet("s3a://bucket/events/"))
With mergeSchema=true, Spark reads the footer of every file in the input path, computes the union of all schemas, and uses that as the DataFrame schema. Files missing a column return null for that column. The cost is one metadata round-trip per file before any data is read — for a directory of 10000 files, that’s a non-trivial up-front delay.
mergeSchema is great in development and small-scale jobs. In production over a million-file lakehouse, it’s expensive enough that you should be selective.
What Parquet actually unions
Parquet schema merging follows specific rules:
- Adding a column — works. Files without the column return null.
- Renaming a column — does not work. Spark sees two columns; old files have one, new files have the other.
- Type widening (
INT32toINT64,FLOATtoDOUBLE) — works in 2026; older Spark versions threw. - Type narrowing (
INT64toINT32) — fails on rows that don’t fit. - Type changes across categories (string to int) — fails outright.
- Nested struct field addition — works.
- Nested struct field rename — fails the same way as top-level rename.
The takeaway: Parquet handles additive evolution well and renames not at all. In a Parquet-only lake, you do not rename columns. You add the new column, dual-write for a transition period, then drop the old one.
Avro: schema travels with the data
Avro takes a different approach. Each Avro file stores its writer’s schema in the header. When you read, you provide a reader’s schema. The Avro library resolves field-by-field, with explicit rules for missing fields, type promotion, and field ordering.
A field present in the writer’s schema but missing from the reader’s schema is silently dropped. A field present in the reader’s schema but missing from the writer’s schema is filled with the reader-side default value (specified at schema-definition time). A field with the same name and a compatible-but-different type is promoted (int to long, float to double, string to bytes).
Renames are handled by aliases: in the reader’s schema you mark the new field with "aliases": ["old_name"], and Avro recognizes that old data carrying old_name should populate new_name.
This is fundamentally more flexible than Parquet. The cost: row-by-row encoding (no columnar pruning), no predicate pushdown, generally smaller compression ratios.
For Avro at scale you also need a schema registry — a service like Confluent Schema Registry that stores all the historical schema versions for a topic, assigns each one a numeric ID, and makes producers and consumers agree on what they’re sending. Each Avro record gets a 4-byte schema ID prefix; the reader fetches the matching schema from the registry. With a registry, producers can evolve schemas freely, and the registry enforces the compatibility rules (BACKWARD, FORWARD, FULL, NONE) you’ve configured.
This is why Avro dominates streaming and event-ingest pipelines in 2026. Kafka topics carry millions of events with schemas that evolve weekly; Avro plus Schema Registry makes that workable. Parquet at the same churn rate is misery.
Reading Avro in Spark:
df = (spark.read
.format("avro")
.load("s3a://bucket/events/"))
Or with explicit schema and Schema Registry integration (typically through a Confluent or open-source Spark-Avro library):
from pyspark.sql.avro.functions import from_avro
# In a streaming job pulling from Kafka
events = (spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", "user_events")
.load())
# Decode each record's schema by ID and project to the reader schema
decoded = events.select(
from_avro("value", reader_schema_json,
{"mode": "PERMISSIVE",
"schemaRegistryUrl": "https://schema-registry.internal"})
.alias("event")
)
Delta and Iceberg: schema in the table log
Delta Lake (the format we covered in lesson 44) and Apache Iceberg take a third approach: the table’s schema is stored in a transaction log alongside the data files. Every change to the schema is a log entry. Old data files are not rewritten when the schema changes; the reader always projects the file to the current schema.
This gives you the best of both worlds:
- Cheap evolution. Adding a column doesn’t touch existing files. The log records the addition and what default value to use when reading old files.
- Strong validation. A write that doesn’t match the current schema fails by default. You opt into evolution with
mergeSchemaor equivalent options on a per-write basis. - Time travel. The schema is versioned. Querying the table at an old version uses the schema from that version.
- Renames that work. Iceberg supports column renames natively (it tracks columns by ID, not name). Delta added rename support in column-mapping mode (1.2+) — it’s an opt-in feature because it requires Parquet readers that respect column-mapping metadata.
A typical Delta evolution flow:
# Initial write
(orders_v1
.write
.format("delta")
.mode("overwrite")
.save("s3a://bucket/orders/"))
# A week later, upstream adds a discount_code column
# We want to add it to the table without rewriting history
(orders_v2_with_discount
.write
.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save("s3a://bucket/orders/"))
# Reads after this see the new column. Old files report null for it.
(spark.read.format("delta").load("s3a://bucket/orders/")
.select("order_id", "discount_code")
.show())
Without mergeSchema=true, the second write fails with a schema-mismatch error. The opt-in is the safety mechanism. You explicitly say “yes, I’m evolving the schema this time.”
For breaking changes, both Delta and Iceberg have explicit DDL:
-- Delta SQL
ALTER TABLE orders ADD COLUMN region STRING AFTER country;
ALTER TABLE orders RENAME COLUMN customer TO customer_id;
ALTER TABLE orders ALTER COLUMN amount TYPE DECIMAL(18,4);
The renames work because the underlying Parquet keeps a stable column ID, and the readers understand the mapping. This was historically the biggest reason to choose Iceberg or Delta over plain Parquet, and it remains so in 2026.
The medallion pattern: where evolution actually lives
In production, the cleanest schema-evolution story uses different formats at different stages:
Kafka (Avro + registry)
↓
Raw layer: Delta or Parquet, schema mirrors the upstream Avro
↓
Clean layer: Delta, with a stable curated schema controlled by the data team
↓
Serving layer: Delta, denormalized for query patterns
Producers control the Avro schemas they emit. The schema registry enforces compatibility — producers can’t break consumers without going through review. Spark Structured Streaming reads from Kafka, deserializes via the registry, and lands records into a “raw” Delta table. The raw table has a permissive schema that mirrors whatever Avro is current.
The “clean” layer is where the data team takes control. Transformations from raw to clean explicitly project to a stable schema. New columns from upstream show up in raw automatically; they don’t show up in clean until someone updates the transformation. This decouples upstream evolution from downstream stability.
This pattern handles backward-compatible changes with zero pipeline edits (the raw layer absorbs them; the clean layer ignores fields it doesn’t project), and it surfaces breaking changes in code review when someone tries to update the clean-layer projection.
Validation at the boundaries
Inside a typed pipeline, type errors are caught at compile time (or at least at schema-mismatch time). At the boundary — the read of upstream data — types are whatever the source produces, and the source can lie. A few patterns that work:
Explicit cast at the boundary. Don’t trust inferred types. Cast everything as it enters the clean layer:
from pyspark.sql import functions as F
cleaned = (raw
.select(
F.col("order_id").cast("long"),
F.col("amount").cast("decimal(18,2)"),
F.to_timestamp("created_at").alias("created_at"),
F.col("country").cast("string"),
))
Fail-fast on unexpected nulls. If a field is supposed to be present and isn’t, you want to know now, not in a downstream report:
null_count = cleaned.filter(F.col("order_id").isNull()).count()
if null_count > 0:
raise RuntimeError(f"{null_count} rows missing order_id")
In a real pipeline this becomes a data quality check, often via a library like Great Expectations or Soda or Spark’s own DataQuality APIs. The principle: validate at the boundary, not after three transformations have already run.
Whitelist the columns you care about. If you .select() a stable list of columns from raw, upstream additions don’t propagate accidentally:
expected = ["order_id", "amount", "created_at", "country"]
cleaned = raw.select(*expected) # Fails loudly if any are missing
What can still go wrong
A few traps:
Implicit type widening that loses precision. A decimal(38,18) summed by Spark returns decimal(38,18) with overflow at scale. A float averaged returns a float with rounding errors. Cast to a wide-enough type before aggregating.
Avro null defaults that aren’t what you think. In Avro, an optional field is typically ["null", "string"] with null as the default. If you mistype the default in the schema (e.g., empty string instead of null), old records load the wrong default forever.
Delta mergeSchema with column-mapping disabled. If you try to rename a column with mergeSchema=true on a default Delta table (without column-mapping mode), Delta sees this as drop-the-old-column-add-a-new-one. You lose the data in the renamed column. Always enable column-mapping before renames.
Parquet rename via “ALTER” on Hive metastore. Hive metastore happily renames a column in its metadata. The underlying Parquet files still have the old name. Reads return null for the renamed column. This bug is older than most data engineers and still ships.
Try this
A small evolution-over-time exercise on Delta:
from pyspark.sql import SparkSession, functions as F
spark = (SparkSession.builder
.appName("EvolutionDemo")
.master("local[*]")
.config("spark.jars.packages", "io.delta:delta-spark_2.12:3.2.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate())
path = "/tmp/delta_orders"
# Day 1: orders with id and amount
day1 = spark.range(10).selectExpr("id as order_id", "id * 1.5 as amount")
(day1.write.format("delta").mode("overwrite").save(path))
print("After day 1:")
spark.read.format("delta").load(path).show()
# Day 2: upstream added a country column
day2 = (spark.range(10, 20)
.selectExpr("id as order_id",
"id * 1.5 as amount",
"'IT' as country"))
(day2.write
.format("delta")
.option("mergeSchema", "true") # opt in to evolution
.mode("append")
.save(path))
print("After day 2:")
spark.read.format("delta").load(path).show()
# id 0..9 have country=null, id 10..19 have country='IT'
# Time travel back to before the country column existed
print("Time travel to version 0:")
spark.read.format("delta").option("versionAsOf", 0).load(path).show()
# Schema does not include country
Run this and watch the schema evolve. The history is preserved. A query at version 0 sees the original schema. A query at the latest version sees the new column. No data was rewritten.
This is Module 8 done. Across these eight lessons you’ve covered file formats (Parquet, ORC, Avro, Delta), the JDBC connector for relational sources, cloud object storage, and the schema evolution patterns that hold it all together. Module 9 picks up streaming: Spark Structured Streaming, the watermarking model, exactly-once semantics, and how all of today’s storage formats serve as both source and sink for streaming jobs.
References: Apache Spark documentation, Delta Lake documentation (https://docs.delta.io/), Apache Iceberg documentation (https://iceberg.apache.org/), Confluent Schema Registry documentation. Retrieved 2026-05-01.