PySpark, from the ground up Lesson 5 / 60

RDD, DataFrame, Dataset — three APIs, one engine

Why Spark has three APIs, what each is good at, when to use which, and why DataFrames won the day for almost everyone.

If you read enough Spark blog posts you’ll come away thinking the framework has three completely different ways to do the same thing, that you have to choose carefully, and that picking wrong will cost you. The first part is half-true. The second part is mostly false. The third part is genuinely false in 2026.

In practice almost everyone uses DataFrames almost all of the time, and there’s a very good reason for that. But to know why DataFrames won, you have to know what they replaced and what they offered that the original API didn’t. So let’s walk through the three, in the order they appeared.

A bit of history, because it explains everything

Spark started life at UC Berkeley’s AMPLab around 2009, was open-sourced in 2010, and became an Apache top-level project in 2014. The original abstraction — the one Matei Zaharia’s 2010 paper introduced — was the Resilient Distributed Dataset, the RDD. For the first three years, that was the entire Spark API.

Then in March 2015 Spark 1.3 shipped the DataFrame API. It looked like an R or pandas DataFrame, but distributed: rows, named columns, a schema, SQL-like operations. Underneath it was a brand new thing called the Catalyst optimizer.

In January 2016, Spark 1.6 added the Dataset API, which was DataFrames plus compile-time type safety in Scala and Java. In Spark 2.0 (July 2016) the three were unified at the API level — DataFrame officially became Dataset[Row] in Scala — but the conceptual distinction is still alive and worth knowing.

Three APIs. One engine. Different abstractions over the same execution model.

RDD: the original

An RDD is, at heart, a partitioned collection of arbitrary objects. If you’ve used Python lists, an RDD is “a list, except chunked into partitions, and each chunk lives on a different machine, and you can map and filter it in parallel.”

# This is RDD code. We rarely write this anymore.
rdd = spark.sparkContext.parallelize(range(1_000_000), numSlices=8)
result = (
    rdd.map(lambda x: x * 2)
       .filter(lambda x: x % 3 == 0)
       .reduce(lambda a, b: a + b)
)

Three things about that snippet are characteristic of RDD code. First, the operations (map, filter, reduce) take arbitrary functions — Python lambdas, in this case. Second, the data type is whatever you put in there: integers, tuples, your own classes, dicts, anything Python can pickle. Third, there’s no schema. The RDD has no idea what’s inside; it just trusts that your lambda x: x * 2 will do something sensible.

That flexibility is also the weakness. Because the operations are opaque Python functions, Spark cannot look inside them. It cannot rewrite map(f).filter(g) into anything cleverer than “apply f to every row, then apply g to every row.” It has no idea whether f reads the whole row or just one column. It can’t push filters down. It can’t optimise joins because it doesn’t know what a “join key” is — to an RDD, a join is just a generic shuffle of (K, V) pairs.

What RDDs gave the world:

  • A general parallel-processing primitive. You could use Spark to do things that aren’t tabular at all — graph traversals, ML iterations, string processing, geospatial work.
  • Fault tolerance via lineage. Each RDD remembers how it was derived. If a partition is lost, Spark recomputes it from the source.
  • Low-level control. You could choose your own partitioner, custom serialisation, broadcast variables, accumulators, the lot.

When you’d reach for RDDs in 2026:

  • A specific algorithm that needs fine-grained control over partitioning (e.g. a custom partitioner for a domain-specific key distribution).
  • Graph algorithms that don’t fit DataFrames cleanly. Though even here, GraphFrames (DataFrame-based) has eaten most of the use cases.
  • Custom serialisation for objects that don’t fit the DataFrame schema model — say, large binary blobs with unusual access patterns.
  • Legacy code you’re maintaining. There’s a lot of RDD code still running in production from the 2014–2017 era.

For 99% of analytical work — read some files, filter, join, aggregate, write — RDDs are the wrong tool. They’re harder to write, slower to run, and the optimizer can’t help you.

DataFrame: the API that won

A DataFrame is a distributed table with named columns and a schema. It looks like this:

df = spark.read.parquet("s3://runehold/orders/")

(df.filter(df.country == "IT")
   .groupBy("product_id")
   .agg({"amount": "sum"})
   .show())

Three things are different from the RDD code. First, the operations (filter, groupBy, agg) are declarative — they describe what you want, not how to compute it. Second, columns are nameddf.country, df.amount. Third, there’s a schema — Spark knows country is a string, amount is a double, and so on.

The reason all of that matters is Catalyst, Spark’s query optimizer. When you write df.filter(df.country == "IT").groupBy("product_id").agg(...), Spark doesn’t actually run those steps in that order. It builds a logical plan, then runs it through Catalyst, which:

  • Pushes the filter down into the data source. If you’re reading Parquet, the filter on country becomes part of the file scan — Spark only reads the row groups where country = 'IT' is possible. You can save 99% of the I/O on a properly partitioned dataset.
  • Prunes columns. If you only use country, product_id, and amount, Spark only reads those columns from disk. Parquet’s columnar layout makes this nearly free.
  • Picks the right join strategy. Broadcast hash join when one side is small. Sort-merge join when both are large. Shuffle hash join in specific circumstances. The optimizer makes this call based on stats, not on what you wrote.
  • Reorders operations. Filters before joins. Aggregations before joins where possible. Constant folding. Common subexpression elimination.
  • Generates whole-stage code. This is the magic part. Catalyst takes the physical plan and generates Java bytecode at runtime that fuses many operations into a single tight loop. A filter -> map -> filter -> sum chain doesn’t run as four steps; it runs as one. This is why DataFrame Spark on the JVM is roughly as fast as hand-written Java doing the same thing.

The performance gap between RDD and DataFrame for typical analytical work is not subtle. It’s regularly 5x to 50x in favour of DataFrames, and it scales with the cleverness of the operations you’re doing. A simple count is similar between the two. A complex multi-join with filters and aggregations is dramatically faster as DataFrames.

The DataBricks deep-dive on Catalyst (linked at the bottom) is the original technical writeup, and it’s still the clearest explanation if you want to understand exactly how the optimizer works.

DataFrames also speak SQL. Anything you can do with the DataFrame API, you can also do with spark.sql("SELECT ..."). The two are interchangeable — same Catalyst, same execution. Many production codebases mix both freely: DataFrame API for data manipulation, SQL for the actual analytical queries because the SQL is easier for analysts to read.

Dataset: the typed cousin

In Scala (and Java), a Dataset is a typed DataFrame. You define a case class, and the Dataset knows the columns are exactly those fields with those types:

case class Order(orderId: Long, country: String, amount: Double)

val orders: Dataset[Order] = spark.read.parquet("...").as[Order]

orders.filter(_.country == "IT")     // typed: _.country is a String
      .map(_.amount * 1.22)          // typed: returns Dataset[Double]

The benefit is compile-time type safety. If you write _.cuontry (typo), the Scala compiler catches it before you even submit the job. If you write _.amount.toUpperCase (nonsense), the compiler catches it. Refactoring a column rename becomes a compile error in every place that needs updating.

The cost is that when you use the typed API (map, filter with lambdas operating on Order), Catalyst can’t see inside the lambda — same problem as RDDs. You get type safety, you lose some optimisation. Most Scala teams use a mix: untyped DataFrame operations where Catalyst can help, typed Dataset operations where the type safety is worth more than the optimisation.

In PySpark, Datasets do not exist. Python is dynamically typed; there’s no compile-time type system to enforce. The PySpark DataFrame is conceptually equivalent to Scala’s Dataset[Row]. When you read PySpark documentation that says “this returns a DataFrame,” and you read Scala documentation that says “this returns a Dataset[Row],” those are the same thing.

This is a relief, actually. PySpark users have one less abstraction to learn. We have DataFrames and RDDs, and that’s it.

What this means for your day-to-day PySpark

Three rules of thumb:

Rule 1: Use DataFrames. This is the API. This is what the docs are written around, what the optimizer is built for, what every PySpark code review will expect. If you find yourself reaching for RDDs, ask why three times before you do.

Rule 2: Use native Spark functions, not Python lambdas, whenever possible. Inside the DataFrame API, there are two ways to express a transformation. The fast way: from pyspark.sql import functions as F and then F.upper(F.col("name")), F.when(...), F.regexp_replace(...). These are JVM-side operations, fully visible to Catalyst, no Python involved. The slow way: a Python UDF (@udf decorator) that processes one row at a time in a Python worker. The JVM and Python have to talk for every row. We’ll get into this in the next lesson.

Rule 3: Drop down to RDDs only with a specific reason. “I prefer the syntax” is not a reason. “I need a custom partitioner that no DataFrame primitive can express” is a reason. “We have a 4,000-line legacy module written against RDDs” is a reason. “I’m porting GraphX code” is a reason. Outside of those, write DataFrames.

In a typical Runehold-style data engineering codebase — read parquet, join with reference data, aggregate, write back — you can go years without touching the RDD API. The only time most teams meet RDDs is when they .rdd on a DataFrame to inspect partitioning, or when an old StackOverflow answer suggests something that turns out to be 2017-vintage.

Quick comparison

AspectRDDDataFrameDataset (Scala)
AbstractionPartitioned collection of objectsDistributed table with schemaTyped distributed table
OptimizerNoneCatalystCatalyst (untyped ops)
SchemaNoYesYes (compile-time)
Type safetyRuntime onlyRuntime onlyCompile time
Available in PySparkYesYesNo (= DataFrame)
Performance for analyticsBaseline5–50x fasterLike DataFrame
When to useNiche / legacyDefaultScala teams who want types

That table is your cheat sheet. Tape it to your monitor for the first month and then throw it away — by lesson 20 you’ll have it memorised.

Next lesson: PySpark vs Scala Spark, and what actually crosses the wire when you call .filter() from Python.

References

Retrieved 2026-05-01.

Search