PySpark, from the ground up Lesson 44 / 60

ORC, Avro, Delta: the alternatives and when each wins

Three format families that aren't Parquet, when each is the right choice, and why Delta has been quietly taking over.

Last lesson we made the case that Parquet is the right default for analytical work. Most teams stop reading there and never investigate the rest. That’s mostly fine — but you’ll occasionally meet pipelines where Parquet isn’t the answer, and you should know why.

Three format families dominate the “not Parquet” space:

  • ORC — Parquet’s older sibling, born in the Hive ecosystem.
  • Avro — A row-oriented format optimized for streaming and schema evolution.
  • Delta Lake / Iceberg / Hudi — Transactional table formats that sit on top of Parquet and add the database semantics it lacks.

Each one wins in a specific situation. This lesson walks through all three so you can recognize when you’re in that situation and act accordingly.

ORC: the Hive native

ORC stands for Optimized Row Columnar, and it’s structurally very similar to Parquet. Files split into stripes (Parquet’s row groups), stripes split into column streams (column chunks), each stream has its own compression, and a footer stores stripe-level statistics. If you squinted at an ORC file’s layout next to a Parquet file’s, you’d struggle to tell them apart.

# Reading is identical to Parquet
df = spark.read.orc("s3://lake/orders_orc/")

# Writing too
(df.write
   .mode("overwrite")
   .option("compression", "zstd")
   .orc("s3://lake/orders_orc/"))

The reason both formats exist comes down to history. ORC was built inside Hortonworks for Hive; Parquet was built inside Twitter and Cloudera, originally for Impala. For a few years they had genuinely different strengths — ORC’s predicate pushdown and built-in indexes (a “min/max + bloom filter” concept it had before Parquet caught up) gave it an edge on certain Hive workloads, while Parquet had better cross-language support and a stronger nested type story.

By 2026 those differences have narrowed. Parquet’s vectorized reader is excellent. Parquet got bloom filters. ORC got better Spark integration. The two formats give roughly equivalent performance on most benchmarks, with edge cases swinging either way depending on data shape and codec.

What hasn’t narrowed is ecosystem gravity. Parquet won the broader analytics ecosystem — Pandas, DuckDB, Polars, BigQuery External Tables, Athena, Snowflake’s external file support, dbt. Almost everything reads Parquet first-class. ORC is read everywhere but treated as a second-class citizen in tools that aren’t Hive-adjacent.

The practical guidance: default to Parquet unless you’re in a Hive shop where ORC is the existing standard. If you’re working in a Cloudera or Hortonworks environment with years of ORC tables, keep using ORC — the conversion cost isn’t worth chasing 5% performance. If you’re starting greenfield, pick Parquet and don’t look back.

Avro: when row-oriented is the right shape

Avro is the odd one out in this lineup. It’s row-oriented, not columnar. Records are stored one after another, each record’s fields packed contiguously. Reading column 3 means reading every record in full. Column projection is fake — you read the bytes and discard the ones you don’t want.

That sounds like a step backward, and for analytics it is. So why does Avro exist at all?

Two reasons: low-latency append writes and schema evolution.

When you ingest Kafka events one at a time and need to durably store them as they arrive, you can’t buffer 128 MB in memory waiting for a row group to fill. Avro lets you write a single record and flush it. The file format is designed for streaming append. This is why “Kafka topic archived to S3” pipelines almost always land in Avro, then get repacked into Parquet downstream by a batch job.

The schema evolution story is even more important. Avro stores its schema with the data — every Avro file declares its writer’s schema in the header. Readers compare the writer’s schema to their own reader’s schema and resolve differences automatically, following well-defined compatibility rules. You can:

  • Add a field with a default value. Old readers ignore it; new readers see the default for old records.
  • Remove a field that has a default. Old readers see the default; new readers stop reading it.
  • Rename a field via aliases. Old code keeps working under the old name.

This kind of forward-and-backward compatibility is essential for event streaming, where producers and consumers deploy independently and you can’t coordinate schema changes across teams. The pattern is paired with a Schema Registry (Confluent’s is the canonical implementation) that stores schemas centrally and assigns each one an ID. The Avro records on the wire reference the schema ID, not the full schema, keeping them small.

Reading and writing Avro with Spark needs the spark-avro package, which is bundled in standard Spark distributions but you may need to declare it explicitly:

spark = (SparkSession.builder
         .appName("AvroDemo")
         .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.5.0")
         .getOrCreate())

# Read
events = spark.read.format("avro").load("s3://kafka-archive/events/")

# Write
(df.write
   .format("avro")
   .mode("append")
   .save("s3://kafka-archive/events/"))

When to reach for Avro:

  • You’re consuming Kafka topics and writing the raw events to long-term storage.
  • Producers and consumers deploy on different schedules and need schema evolution that doesn’t break either side.
  • Your records are typically read in their entirety, not column-projected.

When to not reach for Avro:

  • Analytical queries that touch a few columns out of many. The lack of column pruning will hurt.
  • Anywhere you’d reach for Parquet for performance reasons. The streaming benefit doesn’t apply.

A common production architecture: events land in Avro on raw object storage; a batch job runs every hour to repack into partitioned Parquet for analytics. Avro at the edge, Parquet at the warehouse.

Delta Lake: Parquet plus a transaction log

Now the format that’s been quietly eating the world.

Delta Lake isn’t a new file format — it’s a layer that sits on top of Parquet. The data files are still Parquet. What Delta adds is a transaction log: a directory called _delta_log/ next to your data, full of JSON files that record every commit (add this file, remove that file, change this metadata) in order. Every write produces a new log entry. Every read first consults the log to figure out which files are part of the current table version, then reads only those.

That structure unlocks four things Parquet alone can’t give you:

1. Atomic writes. A Parquet write to S3 lists half-written files during the write — readers can see partial state. Delta writes new files, then atomically commits a log entry that makes them visible. Either the whole write is visible or none of it is.

2. UPDATE / DELETE / MERGE. Parquet files are immutable, so changing rows means rewriting files. Delta automates this: an UPDATE rewrites the affected files and records in the log that the old ones are removed. Readers automatically see the new state.

from delta.tables import DeltaTable

orders = DeltaTable.forPath(spark, "s3://lake/orders_delta/")

# UPDATE rows in place
orders.update(
    condition="status = 'pending' AND age_days > 7",
    set={"status": "'expired'"},
)

# DELETE rows
orders.delete(condition="amount = 0")

3. MERGE INTO (upserts). The killer feature. Combine inserts and updates into one operation, which is exactly what every CDC (change-data-capture) pipeline needs:

updates = spark.read.parquet("s3://staging/orders_changes/")
target = DeltaTable.forPath(spark, "s3://lake/orders_delta/")

(target.alias("t")
   .merge(updates.alias("u"), "t.order_id = u.order_id")
   .whenMatchedUpdateAll()
   .whenNotMatchedInsertAll()
   .execute())

Before Delta, this pattern took a full table rewrite or a complex windowing job. Now it’s three lines.

4. Time travel. Because the log records every version of the table, you can read any past version by version number or timestamp:

# As of 2 days ago
old = spark.read.format("delta").option("timestampAsOf", "2026-04-21").load(path)

# As of version 47
old = spark.read.format("delta").option("versionAsOf", 47).load(path)

Time travel is invaluable for debugging (“what did the table look like before yesterday’s bad job?”) and for reproducibility (“the model was trained on this exact snapshot”). It’s not free — old files are kept until vacuumed, eating storage — but the operational debugging value is hard to overstate.

Reading and writing is straightforward once you’ve added the package:

spark = (SparkSession.builder
         .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.1.0")
         .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
         .config("spark.sql.catalog.spark_catalog",
                 "org.apache.spark.sql.delta.catalog.DeltaCatalog")
         .getOrCreate())

df.write.format("delta").mode("overwrite").save("s3://lake/orders_delta/")
df = spark.read.format("delta").load("s3://lake/orders_delta/")

Delta started as a Databricks-only thing, was open-sourced in 2019, and by 2026 is a Linux Foundation project with first-class support across Spark, Trino, Flink, and the major cloud warehouses. It’s no longer a vendor-lock-in story.

Iceberg and Hudi: same idea, different bets

Delta isn’t the only transactional table format. Apache Iceberg (originally from Netflix) and Apache Hudi (from Uber) attack the same problem with different design choices.

  • Iceberg has a more sophisticated metadata layer (manifest files describing manifest lists describing snapshots) that scales better to truly enormous tables and supports cleaner schema/partition evolution. AWS, Snowflake, BigQuery have all bet hard on Iceberg interop.
  • Hudi focuses on streaming upserts at low latency, with a copy-on-write vs merge-on-read tradeoff exposed to users.
  • Delta has the broadest tooling around it (especially Databricks’ optimizations like Z-ORDER and liquid clustering) and the simplest mental model.

For new projects in 2026, the choice is mostly tribal. Databricks shops use Delta. AWS-centric teams trend Iceberg. Streaming-heavy teams sometimes pick Hudi. The good news: all three solve the same core problems, all three sit on top of Parquet, and all three are converging toward similar feature sets. Pick the one your platform supports best and don’t agonize.

The 2026 lay of the land

Putting it all together, here’s how I’d recommend you think about format choice:

Use caseFormat
Static analytical data lake, append-mostlyParquet
Hive-shop with existing tablesORC (don’t migrate)
Streaming Kafka archiveAvro, repack to Parquet downstream
Mutable lakehouse (UPDATE/DELETE/MERGE)Delta (or Iceberg, or Hudi)
Time-travel debugging requiredDelta (or Iceberg)
One-off small file for human readingJSON / CSV

The macro trend is clear: plain Parquet data lakes are slowly migrating to Delta or Iceberg, because once you’ve experienced atomic writes and MERGE INTO, you don’t go back. The migration is non-trivial (reorganizing partition layouts, retraining the team, picking up the operational cost of a transaction log), but it pays for itself the first time a 3 AM job dies halfway through and the table is still consistent.

Try this

Write the same DataFrame as Parquet, ORC, Avro, and Delta, then explore each:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("FormatsDemo")
         .master("local[*]")
         .config("spark.jars.packages",
                 "org.apache.spark:spark-avro_2.12:3.5.0,"
                 "io.delta:delta-spark_2.12:3.1.0")
         .config("spark.sql.extensions",
                 "io.delta.sql.DeltaSparkSessionExtension")
         .config("spark.sql.catalog.spark_catalog",
                 "org.apache.spark.sql.delta.catalog.DeltaCatalog")
         .getOrCreate())

df = spark.range(0, 100_000).select(
    F.col("id").alias("order_id"),
    (F.col("id") % 50).alias("user_id"),
    (F.rand() * 1000).alias("amount"),
    F.lit("pending").alias("status"),
)

df.write.mode("overwrite").parquet("/tmp/demo/orders_parquet")
df.write.mode("overwrite").orc("/tmp/demo/orders_orc")
df.write.mode("overwrite").format("avro").save("/tmp/demo/orders_avro")
df.write.mode("overwrite").format("delta").save("/tmp/demo/orders_delta")

# MERGE INTO with Delta
from delta.tables import DeltaTable

updates = spark.range(0, 100).select(
    F.col("id").alias("order_id"),
    F.lit(0).alias("user_id"),
    F.lit(99.99).alias("amount"),
    F.lit("paid").alias("status"),
)

target = DeltaTable.forPath(spark, "/tmp/demo/orders_delta")
(target.alias("t")
   .merge(updates.alias("u"), "t.order_id = u.order_id")
   .whenMatchedUpdateAll()
   .whenNotMatchedInsertAll()
   .execute())

# How many rows have status = 'paid' now?
print(spark.read.format("delta").load("/tmp/demo/orders_delta")
      .filter("status = 'paid'").count())   # 100

# Time travel — read the version before the merge
v0 = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/demo/orders_delta")
print(v0.filter("status = 'paid'").count())  # 0

Open /tmp/demo/orders_delta/_delta_log/ after running this. You’ll see two .json files: 00000000000000000000.json for the initial write, 00000000000000000001.json for the merge. That’s the transaction log Delta is built around. Open one and read it — it’s just JSON listing files added and removed.

Next lesson, we leave file formats and start on JDBC sources: pulling data from Postgres, MySQL, and SQL Server, including the partitionColumn trick that prevents you from accidentally DDoSing your production database.

A couple of forward references: lesson 47 covers the cloud-storage caveats (S3 listing costs, eventual consistency, and the _SUCCESS flag) that bite when these formats live on object storage instead of HDFS. Lesson 48 dives deeper into schema evolution — the Avro topic gets fuller treatment there, alongside how Parquet and Delta handle column adds and drops.


References: Apache ORC documentation (https://orc.apache.org/docs/), Apache Avro specification (https://avro.apache.org/docs/current/specification/), and Delta Lake documentation (https://docs.delta.io/latest/index.html). Retrieved 2026-05-01.

Search