PySpark, from the ground up Lesson 3 / 60

What Spark is and why it replaced Hadoop MapReduce

Matei Zaharia's 2010 paper, in-memory execution, the DAG, lazy evaluation, and the 100x-faster claim — what it really means and what it doesn't.

We finished lesson 2 with MapReduce reaching the end of its useful life around 2014, choking on disk I/O and architectural assumptions from the previous decade. Today we look at the system that replaced it: Apache Spark. By the end of this lesson you’ll know where Spark came from, what it does that Hadoop MapReduce couldn’t, what the famous “100x faster” claim actually means, and — equally important — what Spark is not, because half the confusion about Spark in industry comes from people thinking it’s a database, or a storage system, or a scheduler, when it is none of those things.

Where it came from

Spark was born in 2009 at the AMPLab at UC Berkeley, a research group focused on algorithms, machines, and people (which is what the A, M, and P stood for). The lead author was a PhD student named Matei Zaharia, working under professors Scott Shenker and Ion Stoica. Zaharia had been working at Facebook and had watched the iterative-machine-learning problem of MapReduce up close — researchers running the same algorithm over the same dataset, paying the full disk-I/O cost on every iteration. He set out to fix exactly that.

The first public release of Spark was in 2010. The foundational paper, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” was published at NSDI in 2012. Zaharia and his collaborators founded Databricks in 2013, which is now the dominant commercial vendor of Spark and is also, as of late 2026, the company that effectively drives most of Spark’s roadmap. The project was donated to the Apache Software Foundation in mid-2013 and graduated to a top-level Apache project in February 2014. Spark 1.0 shipped in May 2014. Spark 2.0 in 2016 introduced the modern DataFrame API. Spark 3.0 in 2020 brought adaptive query execution, which is one of those features that quietly makes everyone’s queries faster without them having to do anything. Spark 4.0 landed in 2025. The project has been on a roughly two-to-three-year major-version cadence and has been remarkably stable in its core abstractions.

You will see all of this credited to “Databricks” in marketing material and to “Apache” in technical material. Both are correct. Spark is an Apache project; Databricks is the commercial entity built around it; the same handful of people sit at the centre of both.

The three innovations

Spark’s pitch in the original paper is, distilled: take MapReduce, but make it not slow. The way it did this came down to three architectural choices, each of which addressed one of the specific MapReduce pain points from lesson 2.

Innovation one: an in-memory dataset abstraction. This is the RDD, the Resilient Distributed Dataset, the thing Zaharia’s paper is named after. An RDD is a logical collection of records spread across the cluster, just like a MapReduce intermediate dataset, except that the framework can keep it in RAM between operations instead of writing it to HDFS every single time. The “Resilient” in the name refers to fault tolerance: each RDD remembers how it was derived from its parent (its lineage), so if a partition is lost because a worker died, Spark can recompute just that partition from the lineage rather than restarting the whole job. The upshot is that you get the fault tolerance of MapReduce without the disk I/O. For iterative workloads, where you reuse the same dataset across many passes, this is the difference between a job that takes 30 minutes and a job that takes 30 seconds.

We don’t write much RDD code anymore — we use the higher-level DataFrame API, which we’ll meet in lesson 5 — but every DataFrame operation still compiles down to RDDs underneath. The abstraction is foundational and will outlive the API on top of it.

Innovation two: a DAG of transformations, not a fixed two-stage model. MapReduce forced every computation into exactly one map plus one reduce. Spark lets you chain arbitrarily many transformations into a single job, and the engine builds a DAG — a directed acyclic graph — representing the full computation. Filters, maps, joins, group-bys, projections, aggregates, sorts: any combination, in any order, all in one logical job. The Spark engine analyses the full DAG, figures out which operations can be fused together (for instance, two consecutive filter calls can be merged into one), where shuffles are actually required (joins and group-bys; not maps and filters), and produces a physical plan that runs only the shuffles it absolutely needs.

This is a much, much bigger deal than it sounds. In MapReduce, a six-step pipeline meant six distinct jobs, six full disk round-trips, and six rounds of JVM startup overhead. In Spark, a six-step pipeline is one job, with maybe two shuffles in the middle of it, all of the in-between work happening in memory. You write the same logical query and it runs an order of magnitude faster, with no extra effort on your part beyond using the higher-level API.

Innovation three: lazy evaluation. Most operations in Spark don’t actually do anything when you call them. df.filter(...), df.select(...), df.join(...) — none of these execute. They build up the DAG, but no work happens. Only when you call an actioncount(), collect(), write.parquet(...), show() — does the engine kick in, look at the entire DAG you’ve built, optimise it, and run it.

This is what makes the Catalyst optimiser possible. Catalyst is Spark’s query optimiser; it gets to look at your whole computation before any of it runs and rearrange it for performance. It can push filters down to the data source so you read less data off disk in the first place. It can reorder joins so the smaller datasets get joined first. It can recognise that you only need two columns out of a 200-column Parquet file and read just those two columns. None of this is possible in an eagerly-evaluated system, where each operation runs immediately on the result of the previous one. Lazy evaluation is the price of admission for a smart optimiser, and Spark’s optimiser is one of the things that keeps it competitive against newer entrants.

The trade-off is that Spark code is sometimes confusing to debug if you’re used to Pandas. You write what looks like ten lines of perfectly reasonable code, none of it runs, you call .show() at the end, and then the entire ten lines execute as a single optimised batch and any error in any of them surfaces at the .show() line with a stack trace that doesn’t quite match the source. Module 3 of this course is largely about getting comfortable with this.

The “100x faster” claim, and what it actually means

If you’ve read anything at all about Spark, you’ve seen the headline: 100x faster than Hadoop MapReduce. It’s on the homepage of spark.apache.org. It was on every Databricks slide deck for most of the 2010s. It is, depending on what you measure, either entirely true or wildly oversold. Both readings deserve attention because both come up in real conversations.

Where the claim is genuinely true. Iterative algorithms — machine learning, graph processing, anything that reuses the same dataset across many passes — really do run roughly 100x faster on Spark than on MapReduce, and the original Berkeley benchmarks were honest about that. The reason is exactly what you’d expect: MapReduce reads and writes the dataset to HDFS on every iteration; Spark reads it once, keeps it in RAM, and reuses it. If your dataset fits in cluster memory and your algorithm makes 20 passes, Spark does roughly 1 disk read where MapReduce does 20, plus saves 20 rounds of JVM startup. You get a one-to-two-order-of-magnitude speedup almost mechanically.

Multi-stage pipelines — six MapReduce jobs chained together with Oozie — also genuinely run far faster on Spark, for the same reason. Spark fuses the six stages into one job, runs them in memory between shuffles, and skips most of the disk and orchestration overhead.

Where the claim is wildly oversold. Single-pass aggregations — read a big file, do a GROUP BY, write the result — are bottlenecked by raw I/O on both Spark and MapReduce. Reading 1 TB off S3 takes about as long whether you’re doing it in Spark or in Hadoop, because the network and the storage are the bottleneck, not the compute. In that scenario you might see Spark be 1.5x to 3x faster, sometimes 5x, but you will not see 100x. Anyone telling you their SELECT COUNT(*) GROUP BY country query got 100x faster by switching from Hadoop to Spark is misremembering, or comparing 2010 hardware to 2024 hardware, or comparing a poorly-tuned Hive job to a well-tuned Spark job, or just repeating the marketing.

The honest summary: Spark is somewhere between modestly faster and dramatically faster than Hadoop MapReduce, depending on workload, and at this point in 2026 the comparison is mostly historical anyway because nobody’s choosing between Spark and MapReduce on a new project.

What Spark is not

This is the part nobody tells you, and it’s the source of approximately half of the confused conversations about Spark I’ve had over the years.

Spark is not a database. It does not store your data. It does not have tables in the way Postgres has tables. The “Hive table” you query through Spark SQL is metadata pointing at files in object storage; Spark itself owns none of that. There is no such thing as a “Spark database” the way there is a “Postgres database.” If you delete your S3 bucket, your tables are gone, and Spark cannot help you.

Spark is not a storage system. It doesn’t replace HDFS, S3, GCS, ADLS, or any other place you keep data. You bring storage; Spark reads from and writes to it. Spark has connectors for dozens of formats (Parquet, ORC, JSON, CSV, Avro, Delta Lake, Iceberg, JDBC, Kafka, and many more), but the storage itself is somebody else’s problem.

Spark is not a cluster manager. It runs on a cluster manager. Your options are YARN (the Hadoop one, still common), Kubernetes (increasingly the default in 2026), Mesos (deprecated since 2020 and removed in Spark 3.5, mentioned only because old StackOverflow answers reference it), or Spark’s own built-in standalone manager (fine for small clusters and developer environments). The cluster manager allocates machines and processes; Spark uses them.

Spark is, specifically, a distributed compute engine. That’s the entire job description. You give it data (from somewhere), you give it a cluster (from somewhere), and it runs your computation across the cluster against the data. Storage and orchestration are separate concerns, intentionally. This is why Spark deployments look like Lego — pick your storage layer, pick your cluster manager, pick your catalogue, pick your table format — and why migrating between, say, on-prem Hadoop and cloud-native S3 + Kubernetes is mostly a configuration exercise, not a rewrite.

The competitive landscape, in one paragraph

In 2026, Spark is the dominant open-source distributed compute engine, full stop. There are competitors and they’re each interesting in their own niches. Apache Flink is genuinely better than Spark at low-latency true-streaming workloads, and if you’re building a real-time fraud detection system that needs sub-second latency you should look at it. Dask is a pure-Python alternative that scales Pandas-style code to clusters and is more pleasant for small-team data science work that doesn’t need cross-language support. Ray is the Python-native distributed compute framework most associated with modern ML and reinforcement-learning workloads. DuckDB is, depending on how you squint, either a single-node analytical database or a serious threat to Spark in the small-to-medium-data range we discussed in lesson 1. Polars is doing the same thing on the in-memory side. All of these are real, all of them have their place, and none of them have displaced Spark for batch and streaming ETL on warehouses bigger than a single fat machine. That’s still Spark’s home turf, and it’s where this course will keep you for the next 57 lessons.

Next lesson: the architecture. Driver, executors, cluster manager, tasks, stages, jobs — all the pieces that have to be in your head before we start writing code. After that we can finally install something.

For further reading, the Apache Spark documentation is the canonical reference, the original RDD paper is the foundational text, and the Spark source on GitHub is genuinely readable if you ever want to know why the engine is doing what it’s doing.

Search