Spark and modern batch · Narcis Miclaus

Lesson 34 ended with a list of things Hadoop got right and things it got wrong. The most consequential of the wrong things was disk-heavy intermediate state: every shuffle hit disk, every iteration of an iterative job paid the disk tax again, and machine-learning and graph workloads ran orders of magnitude slower than they had any right to. The replacement for MapReduce came out of UC Berkeley’s AMPLab in 2010, generalised in a 2012 paper, became an Apache top-level project in 2013, and by 2018 had eaten almost the entire batch-processing market that Hadoop used to own. This lesson walks through Spark, the ecosystem around it, the engines that compete with it for different niches, and the decision tree of when to reach for which.

The framing for the lesson is comparative. Spark is one of several batch tools that a 2026 data engineer might reach for, and the interesting question is rarely “is Spark good?” but “is Spark the right tool for this specific shape of work?”. The answer depends on the size of the data, the latency budget, the team’s existing platform, and whether SQL on a managed warehouse already covers the use case.

Why Spark won

Matei Zaharia and the AMPLab group introduced Spark to fix the specific weaknesses of MapReduce while keeping the parts that worked. The original paper (“Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”, NSDI 2012) is worth reading; the core ideas are short.

In-memory intermediate state. A Spark job represents the data as a series of immutable, partitioned datasets called RDDs (Resilient Distributed Datasets). Operations build a new RDD from an old one. Crucially, intermediate RDDs can be cached in memory across the cluster and reused by later stages without round-tripping through disk. For an iterative job like training a logistic regression on the same training data for fifty epochs, MapReduce paid the disk cost fifty times; Spark pays it once. The reported speedups in the 2012 paper were ten to one hundred times for iterative workloads, and they held up in production.

DAG-aware execution. MapReduce ran one map-reduce pair at a time, materialising output to HDFS between jobs. A query that needed three map-reduce passes was three separate jobs with three writes to HDFS. Spark builds a directed acyclic graph of all the operations in a single job, optimises across the whole graph, and only materialises the final result. The cost model picks better shuffle boundaries, and the runtime can pipeline operations that do not require a shuffle between them.

Better APIs. The first Spark API was a Scala-native functional one, which was already a step up from raw Hadoop. The second was DataFrames (2015), modelled on R and pandas, with a query optimiser (Catalyst) underneath that compiled DataFrame operations into efficient execution plans. The third was Spark SQL, which let you write a select against a DataFrame as if it were a database table. By 2020 most Spark code was DataFrames or SQL, with raw RDDs reserved for the rare cases where the higher-level APIs did not fit. PySpark made the same APIs available from Python, which is what most data engineers actually wrote.

Polyglot from day one. Scala (the native language), Java, Python (PySpark), and R all had first-class APIs. SQL came later but became dominant. The same engine ran all of them, which meant a team of mixed Scala-and-Python engineers could share a codebase without the language being an integration boundary.

The combination of speed, expressiveness, and language coverage was enough to displace MapReduce as the default within a few years. Hadoop clusters did not disappear overnight; they got Spark installed alongside MapReduce, then Spark slowly took over as the engine, then HDFS got replaced by S3, and one day there was no Hadoop left in the building.

The Spark ecosystem in 2026

Spark in 2026 is a stack of components on top of the same execution engine.

Spark SQL and DataFrames. The dominant API. Almost all new batch code is written here. The Catalyst optimiser turns logical plans into physical plans, and the Tungsten execution engine generates efficient bytecode for the operators. For most teams, “writing a Spark job” means writing SQL or chained DataFrame operations and trusting the optimiser.

Structured Streaming. The streaming API of Spark, sharing the same DataFrame model and engine. The pitch is “the same code that processes a batch of yesterday’s data can process today’s stream”, and Spark gets remarkably close to delivering that. Module 6 covers streaming in detail; for now the relevant fact is that Spark is one of the engines competing for that workload.

MLlib. The classical machine-learning library on top of Spark: distributed implementations of regression, clustering, recommendation, gradient-boosted trees. MLlib mattered when training data did not fit on one machine and deep learning was not the dominant approach. In 2026 most ML training has moved to dedicated frameworks (PyTorch, JAX, scikit-learn on a single big machine, or specialised platforms). MLlib still ships and still works; it is rarely the first choice.

GraphX. The graph-processing API. It exists. Almost nobody uses it. Graph workloads in 2026 mostly run on dedicated engines (Neo4j, JanusGraph, NebulaGraph, the graph extensions inside Postgres or DuckDB) or on Spark with a graph library bolted on. GraphX is a footnote.

The point of listing the components is that “Spark” in marketing materials covers a wide footprint, and the only piece most teams actually use heavily is Spark SQL plus DataFrames. The rest is available, and occasionally relevant.

Databricks

Spark stayed an Apache project, but the company spun out of the AMPLab to commercialise it (Databricks, founded 2013) became dominant in a way that no Apache-project commercial backer has matched since Cloudera’s early years. Databricks runs the largest managed-Spark service, sells a notebook-and-cluster product that has effectively become the data-platform-of-record for many enterprise data teams, and has extended the platform with proprietary additions that the open-source Spark project does not have. Delta Lake (lesson 37) is the most consequential of these, and Unity Catalog (governance), Photon (a C++ vectorised query engine for Spark SQL), and the AI/BI products are all Databricks-only.

In 2026, Databricks has gone public (IPO completed late 2024), is the second-largest data-platform vendor by revenue after Snowflake, and is the default answer for “we want a Spark-based data platform without operating Spark ourselves” in the enterprise. The competing managed Sparks (AWS EMR, Google Dataproc, Azure Synapse Spark) are real but have not displaced Databricks for shops that have committed to a Spark-centric stack.

The competitors in 2026

Spark is not the only batch engine, and it is increasingly not the right one for several common shapes of work. The competitive landscape is the actually interesting part of the lesson.

DuckDB. An in-process, columnar, single-machine analytical database, written in C++. DuckDB’s pitch is that an enormous fraction of “I need to crunch some Parquet files” workloads do not need a cluster: a single modern server with a hundred gigabytes of RAM can chew through hundreds of gigabytes of Parquet on disk in seconds, with no cluster, no scheduler, no operational footprint. DuckDB runs as a library inside Python, R, Node, or a CLI; you import it, point it at S3, and write SQL. For datasets up to about a terabyte, DuckDB is often faster than Spark on the same query, because there is no shuffle, no JVM, no driver-executor coordination, just one process reading columnar data with a vectorised execution engine. DuckDB has eaten a meaningful fraction of the “small-to-medium analytics” workload that used to run on Spark by default.

Polars. A DataFrame library written in Rust, with a Pandas-like API. Same niche as DuckDB but with a DataFrame surface rather than a SQL one, and a similar performance profile: vectorised, columnar, single-machine, fast. Polars and DuckDB have become close substitutes for many use cases, and the choice between them is often driven by whether the team prefers SQL or DataFrames as the surface area.

BigQuery, Snowflake, Redshift, Databricks SQL. The hosted warehouses. As lesson 33 covered, the ELT pattern is to load raw data into the warehouse and transform it with SQL. For workloads that fit in this pattern (most BI and analytics), the warehouse engine handles the batch processing internally, and Spark is not in the picture at all. The user writes a select and the warehouse engine (Dremel inside BigQuery, the proprietary engines inside Snowflake and Redshift, Photon inside Databricks SQL) does the distributed batch work invisibly.

Trino and Presto. Distributed SQL query engines, originally developed at Facebook (Presto, 2013) and forked into Trino (2020). Both are designed to query data sitting in object storage and warehouses without copying it. Trino is the dominant open-source choice for “SQL across many sources”, with Starburst as the commercial backer. It overlaps with Spark SQL for ad-hoc analytical queries and with the warehouses for federated reads. It is rarely the first choice for heavy ETL but often the right answer for interactive analyst queries across a lake.

Dask and Ray. Python-native distributed computing libraries. Dask is closest to Spark in spirit: a DataFrame API and a task scheduler, both in Python, that scale up to clusters. Ray is more general (a distributed-task framework) and has become the default backend for distributed ML training and serving. Both have a niche where Python is the primary language and the workload does not fit Spark or warehouse SQL well, but neither has displaced Spark for general-purpose batch.

When to reach for Spark in 2026

A short decision tree.

flowchart TD
    Q[Batch work to do] --> SizeQ{Data size}
    SizeQ -->|"Fits in RAM<br/>on one machine"| Single[DuckDB or Polars]
    SizeQ -->|"100GB to a few TB"| MidQ{Already in a<br/>warehouse?}
    SizeQ -->|"Many TB to PB"| Spark[Spark or Databricks]
    MidQ -->|Yes| WH[Warehouse SQL<br/>BigQuery, Snowflake, Redshift]
    MidQ -->|No, on object storage| LakeQ{Need streaming<br/>or complex ETL?}
    LakeQ -->|No, just SQL| Trino[Trino or DuckDB on S3]
    LakeQ -->|Yes| Spark
    Spark --> StreamQ{Sub-second<br/>latency?}
    StreamQ -->|Yes| Flink[Flink, see Module 6]
    StreamQ -->|No| SparkOK[Spark Structured Streaming<br/>or batch]

Spark earns its place when one or more of the following is true.

Data is genuinely big. Terabytes to petabytes per job. Single-machine engines start to thrash, warehouse credits get expensive, and the cluster-scale shuffle that Spark does well becomes the bottleneck. This is the workload Spark was designed for and the one where it is still the default.

The job mixes batch and streaming on the same data. Structured Streaming lets the same code path serve both modes, with a meaningful efficiency win over running two separate stacks. If your platform needs both and you do not want to operate Flink for the streaming side, Spark is a good answer.

You are already on Databricks or have a Spark cluster. Inertia is a legitimate tiebreaker. If the platform is in place, the team knows it, and the workload fits, “use what you have” is the correct decision.

The transformation involves things SQL does not express well. Heavy use of UDFs, integration with Python ML libraries, complex graph or array operations: Spark’s DataFrame API is more expressive than warehouse SQL for these cases.

When NOT to reach for Spark.

The data fits on one machine. A single 64-core, 512 GB server processes hundreds of gigabytes of Parquet faster than a Spark cluster does, with one binary and one operator (you). DuckDB or Polars is the right answer here. The threshold has been moving up every year as RAM gets cheaper and DuckDB gets faster, and a lot of “we needed Spark” workloads from 2018 are single-machine workloads now.

The work can be expressed as SQL on a managed warehouse. If the data is already in Snowflake or BigQuery and the transformation is SQL, the warehouse engine is faster, cheaper, and dramatically simpler to operate than spinning up a Spark job to read out, transform, and write back. Lesson 33 covered why this is the dominant pattern.

Real-time, sub-second latency is required. Spark Structured Streaming has improved its low-latency performance substantially since the original micro-batch model, but Flink is still the dominant choice for genuinely event-by-event processing with millisecond budgets. Module 6 walks through that decision in detail.

Cross-link to PySpark

If you are choosing Spark for your platform and want to learn the engine, this site has a dedicated PySpark course at /programming/courses/pyspark/ that goes deeper into the actual job-writing experience: cluster sizing, partition tuning, shuffle behaviour, the join strategies the optimiser picks, and the ways production Spark jobs go wrong. This lesson covers the architectural decision; the PySpark course covers the day-to-day craft.

Closing the lesson

Spark sits in the middle of the modern batch landscape. It is no longer the default for small workloads (DuckDB), and it is no longer the default for warehouse-shaped workloads (Snowflake or BigQuery), but it is still the default for genuinely big batch and for unified batch-and-streaming on the same engine. The framing the rest of Module 5 will use is that Spark is one of several specialised engines, each earning its place against the alternatives. Lesson 36 turns to the question of how data is laid out on the storage underneath all these engines: the file formats, the table formats, and the open-table-format wars that produced Delta, Iceberg, and Hudi.

Citations and further reading

Matei Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”, NSDI 2012, https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia (retrieved 2026-05-01).
Apache Spark documentation, https://spark.apache.org/docs/latest/ (retrieved 2026-05-01).
Databricks documentation, https://docs.databricks.com/ (retrieved 2026-05-01).
DuckDB documentation, https://duckdb.org/docs/ (retrieved 2026-05-01). Worth reading the “Why DuckDB” page for the design choices.
Polars user guide, https://docs.pola.rs/ (retrieved 2026-05-01).
Trino documentation, https://trino.io/docs/current/ (retrieved 2026-05-01).
“Spark: The Definitive Guide” (Bill Chambers and Matei Zaharia, O’Reilly, 2018). Slightly dated but still the clearest end-to-end Spark reference.