You can use Spark for years and never look at the architecture diagram. Most people do. They write df.groupBy(...).agg(...), click run, and walk away. That works until something breaks — a job hangs at 99%, an executor dies, the Spark UI shows red bars and shuffle reads in the gigabytes — and then “I don’t really know what’s running where” becomes a serious problem.
So before we touch a single line of PySpark, let’s build the mental model of what happens when you submit a job. By the end of this lesson you’ll be able to look at the Spark UI and read it like a novel: which JVM is doing what, where the parallelism is, and why the thing called “shuffle” is the villain of every performance story.
The cast of characters
A running Spark application has, at a minimum, three things:
- A driver — one JVM process that runs your code and coordinates everything.
- A cluster manager — the thing that hands out machines (or rather, slots on machines) to your application.
- One or more executors — JVM processes that actually do the work.
That’s it. No magic. No background daemon you forgot to install. Three components, three responsibilities. Once you internalise this, Spark stops being a black box.
Let’s walk through each.
The driver
The driver is the JVM that holds your SparkSession. When you run spark-submit my_job.py, or when you start a notebook and the first cell creates a session, the driver is the process that launches. It runs your Python (or Scala) code top to bottom, exactly the way a normal program would.
The driver is single-threaded for the linear flow of your code. Your for loops are sequential. Your if statements are sequential. Your print statements come out in order. The driver is just a regular program that happens to know how to talk to a cluster.
What the driver actually does:
- Holds the
SparkSessionandSparkContext— your entry points into the engine. - Builds the DAG (Directed Acyclic Graph) of transformations as you chain
.select(),.filter(),.join(),.groupBy(). Nothing executes yet — the driver is just collecting a recipe. - When you call an action (
.count(),.show(),.write.parquet(...),.collect()), the driver hands the DAG to the Catalyst optimizer, which rewrites it into an efficient physical plan. - The driver then breaks that plan into stages and tasks and ships those tasks out to the executors.
- It receives results back, monitors progress, restarts failed tasks, and ultimately returns the final answer to your code.
Two things to remember about the driver. First: when you call .collect(), every row of the result comes back to the driver’s JVM heap. If you .collect() a billion-row table, your driver will OOM and your job dies. Always know whether your action is “summary” (.count(), .show(20)) or “all of it” (.collect(), .toPandas()). Second: the driver is a single point of failure. If it dies, your application dies. Cluster managers can restart executors transparently; they cannot restart a driver mid-flight.
The executors
Executors are the JVMs that crunch the numbers. They run on worker nodes (machines in your cluster) and the driver gives them work. A typical Spark cluster has anywhere from 2 to 2,000 executors, depending on the job and the budget.
Each executor has two resources you care about:
- Memory. Used for caching DataFrames you’ve called
.cache()on, holding shuffle data, running tasks. You configure this withspark.executor.memory=4gor similar. - Cores. The number of tasks an executor can run in parallel. An executor with 4 cores runs 4 tasks at once. You configure this with
spark.executor.cores=4.
Multiply executors by cores and you get your total parallelism. A cluster with 10 executors and 4 cores each can run 40 tasks simultaneously. If your DataFrame has 200 partitions, those 200 tasks will be processed in roughly 5 waves of 40.
Executors live for the life of the application. They start when your SparkSession starts, they die when your application ends. (Dynamic allocation can add and remove executors during the job, but that’s an optimisation we’ll get to much later.) An executor that dies during a job — OOM, machine failure, network glitch — gets replaced by the cluster manager, and Spark re-runs whatever tasks were on it. This is the famous fault tolerance: lineage means Spark can always recompute what it lost.
A subtle but important point: in PySpark, each executor JVM may also spawn one or more Python worker processes if your job uses Python UDFs or RDD operations. The JVM and the Python workers communicate over a socket, and data has to be serialised between them. This is where PySpark earns its reputation for being “slower than Scala” — but only if you actually use Python UDFs. We’ll dig into that in lesson 6.
The cluster manager
The cluster manager is the layer that owns the machines and decides who gets to use them. Spark itself doesn’t run machines — it asks something else to give it some.
The four cluster managers Spark supports:
- Spark Standalone. Spark’s own bundled manager. Comes with the Spark distribution, easy to set up, fine for small dedicated clusters. You’ll see it in tutorials and lab environments.
- YARN. Hadoop’s resource manager. If your company has a Hadoop cluster — and a surprising number still do in 2026, though shrinking every year — Spark on YARN is the default. YARN was the dominant production deployment for years.
- Kubernetes. The modern cloud-native option. Spark on K8s has matured massively since 2020 and is now the standard for greenfield deployments on AWS EKS, GCP GKE, Azure AKS. Databricks, EMR, and most managed Spark platforms run on K8s under the hood. If you’re starting a new Spark deployment in 2026, this is almost certainly what you want.
- Mesos. Once a contender, now mostly dead. Apache retired Mesos in 2021. You’ll see it in legacy installations and nowhere else.
The cluster manager’s job is narrow: when your driver requests “give me 10 executors with 4 cores and 8 GB of RAM each,” the cluster manager finds machines that have those resources free and starts the executor JVMs there. Once they’re up, the driver talks directly to the executors. The cluster manager is mostly out of the picture during the actual work.
You don’t usually pick a cluster manager — your platform picks for you. Databricks picks Kubernetes (their own flavour). EMR lets you pick between YARN and Kubernetes. Glue picks something proprietary. Most of the time you write the same PySpark code regardless and the manager is invisible.
How a job actually runs
Now let’s trace what happens when you call an action. Say you run:
result = (
spark.read.parquet("s3://bucket/orders/")
.filter("order_status = 'shipped'")
.groupBy("country")
.agg({"amount": "sum"})
.collect()
)
Here is the sequence:
- The first three lines (
read,filter,groupBy,agg) build up a DAG in the driver. No data is read yet. .collect()is an action. The driver sends the DAG to Catalyst, the query optimizer. Catalyst rewrites it: it pushes the filter down into the Parquet read so we never read shipped-only rows; it picks an aggregation strategy; it produces a physical plan.- The physical plan is split into stages. A stage is a contiguous chunk of work where every operation can be done independently per partition with no data movement. The filter and the partial aggregation can happen in stage 1. The groupBy across all data requires data movement (a shuffle), so the final aggregation is in stage 2.
- The driver asks the cluster manager for executors (if it doesn’t already have them).
- Stage 1 launches as N tasks, where N is the number of input partitions. If the Parquet has 200 files, N is probably 200. Each task is one partition of work, sent to one core on one executor.
- As stage 1 tasks finish, they write their partial results to local disk on the executor — this is shuffle write.
- Stage 2 launches. Each task in stage 2 reads its input from the shuffle output of stage 1 across the network — shuffle read — and computes the final group sums.
- The final results are sent back to the driver, which assembles them into a Python list and returns them to your code.
Three vocabulary words from that sequence are the ones you’ll see in the Spark UI all day:
- Job. What an action triggers. One
.collect()call = one job. - Stage. A contiguous slice of the DAG with no shuffle in the middle. Most jobs have 2–10 stages.
- Task. One partition of one stage running on one executor core. The unit of parallelism. A typical job has thousands of tasks.
The “aha” moment in the Spark UI
The Spark UI (default port 4040 on the driver) is the single most useful diagnostic tool in the whole ecosystem. Once you understand the architecture, the UI tells you everything.
When you open the Executors tab, you’re looking at the JVMs doing the work. Memory used, cores active, tasks completed, GC time, shuffle bytes read and written. If one executor has done 10x the work of the others, you have skew and your job is suffering for it.
When you open the Stages tab, you’re looking at the slices of the DAG. The duration of each stage, the number of tasks, the input size, the shuffle read/write. A stage taking 10 minutes when the others took 30 seconds is the bottleneck — go look at it.
When you open the Tasks view inside a stage, you’re looking at the individual partition-level units. The min, median, and max task duration tell you whether work is balanced. If the max is 50x the median, you have a bad partitioning strategy and a few rows are doing the work of millions.
This is the whole loop of Spark performance tuning: read the UI, find the slow stage, find the skewed task, fix the data layout. Almost everything else is a footnote.
What we haven’t covered yet
A few things deliberately skipped, because they’re whole lessons of their own:
- Memory layout of an executor (storage memory, execution memory, the unified memory manager). Lesson 14.
- Shuffle internals — sort-shuffle, push-based shuffle, the role of the external shuffle service. Lesson 22.
- Adaptive Query Execution (AQE), which dynamically rewrites the plan mid-job based on real partition sizes. Lesson 24.
- Cluster autoscaling and dynamic allocation — Spark adding and removing executors as it discovers it needs them. Lesson 41.
For now, hold this picture in your head: one driver, many executors, a cluster manager handing out the machines, and a DAG of work flowing from your SparkSession down to the partitions on disk. Everything else in PySpark is detail.
Next lesson: the three APIs Spark gives you — RDD, DataFrame, Dataset — and why for ~99% of work you only ever touch one of them.
References
- Apache Spark — Cluster Mode Overview: https://spark.apache.org/docs/latest/cluster-overview.html
- Apache Spark — Submitting Applications: https://spark.apache.org/docs/latest/submitting-applications.html
- Apache Spark — Monitoring and Instrumentation (Spark UI): https://spark.apache.org/docs/latest/monitoring.html
Retrieved 2026-05-01.