PySpark, from the ground up Lesson 55 / 60

The Spark UI: the most important tool you'll learn

A guided tour of every tab — Jobs, Stages, Tasks, SQL, Storage, Executors — and what each one tells you when something is wrong.

If you only learn one operational tool from this entire course, make it the Spark UI. Every performance question — “why is this job slow,” “why did it OOM,” “is my cache working,” “is my filter pushing down,” “is one task doing all the work” — has its answer in the UI. Logs tell you what happened. The UI tells you why it was slow, and that’s the question that pays the bills.

Module 10 is the production half of the course. We’re going to spend it on debugging, tuning, and the toolbox a Spark engineer reaches for at 2am. This lesson is the centerpiece: the UI, every tab, what columns to read first, what the patterns look like.

Where to find it

For a local SparkSession, the UI binds to localhost:4040 by default. If port 4040 is taken (because you have another driver running), Spark increments — 4041, 4042, and so on. The driver logs the actual URL on startup. You can also ask the session directly:

spark = SparkSession.builder.master("local[*]").getOrCreate()
print(spark.sparkContext.uiWebUrl)
# http://192.168.1.10:4040

On real clusters the cluster manager hosts (or links to) the UI:

  • Databricks — every cluster page has a “Spark UI” link. While the cluster runs, the UI is live; for terminated clusters Databricks keeps a static event-log render.
  • EMR — the YARN ResourceManager UI on the master node lists every application; click the application, click “ApplicationMaster” for the live UI, or “History” once it finishes.
  • Kubernetes / Spark Operator — the driver pod exposes 4040 internally; you kubectl port-forward it to your laptop.
  • Plain YARN — same as EMR, the ResourceManager links into it.
  • Spark History Server — for finished applications, a separate server reads the event logs (the spark.eventLog.dir you’ve been writing to) and re-renders the UI on demand.

The UI is a read-only view of the Spark driver’s internal state. You can’t break anything by clicking around. Click everything.

Tabs in order of usefulness

Spark exposes a fixed set of tabs. I’m going to walk through them in the order I actually use them in production, not left-to-right.

1. Jobs — the high-level view

The landing page. One row per job, where a job is everything triggered by a single action (.show(), .count(), .write(), .collect()). Columns:

  • Job ID — sequential.
  • Description — the action name, or whatever you set with spark.sparkContext.setJobDescription("...") (do this; it’s free and makes the UI readable).
  • Submitted — wall-clock timestamp.
  • Duration — wall-clock duration.
  • Stages: Succeeded/Total — how many stages this job had.
  • Tasks (for all stages): Succeeded/Total — total task count and how many are done.

Use the Jobs tab to answer “which action is slow.” Sort by Duration. The slow job is the one to drill into.

If you see a job description like “count at NativeMethodAccessorImpl” with no other context, you’ve got a .count() somewhere you didn’t realize was triggering — often a debug print left in production.

2. Stages — the meat

Click a job and you get its stages. This is where you’ll spend most of your time.

A stage is a sequence of operations Spark can run without a shuffle. Every shuffle is a stage boundary. A typical join job has three stages: scan and pre-shuffle the left side, scan and pre-shuffle the right side, do the join.

Per-stage columns to read first:

  • Duration — how long the stage took.
  • Tasks: Succeeded/Total — task count tells you partition count.
  • Input — bytes read from the source.
  • Output — bytes written.
  • Shuffle Read / Shuffle Write — bytes shuffled. Big shuffles are expensive shuffles.

Click a stage and you get the Task table. This is the single most useful screen in the entire UI.

For each task you see Status, Duration, GC Time, Shuffle Read Size, Shuffle Read Records, Spill (Memory), Spill (Disk), and Errors. Sort by Duration descending. Now look at:

Max vs median. Hit “Summary Metrics” at the top — Spark gives you min / 25% / median / 75% / max for every numeric column. If the max task duration is 10× the median, you have skew. If max input is 10× the median, you have skewed input partitions. If max shuffle read is 10× the median, you have a hot key in a join or groupBy. Skew is the single most common production problem; the task table is where you spot it.

Spill columns. Spill (Memory) is how much uncompressed data Spark had to push out of execution memory. Spill (Disk) is how much compressed data hit local disk because of that push. Any spill at all means you ran out of execution memory and had to fall back to disk. A little is normal under pressure; a lot is the reason your job is slow. We’ll talk about fixing it in lesson 57.

GC time. Time spent in JVM garbage collection per task. If GC is more than ~10% of task duration, you’re under memory pressure even if no spill is happening. Bump executor memory or reduce per-task data.

The DAG visualization at the top of the stage page is also handy — it draws the operators inside the stage and arrows between them. Useful for understanding what the stage is doing, less useful for performance debugging than the task table.

3. SQL / DataFrame — the query plan tab

This is the most-clicked tab in production for DataFrame and SQL workloads. Every query you run shows up as a row with the original SQL text or a generated description, plus an “Execution ID” link.

Click the Execution ID and you get the operator graph: every node in the physical plan as a box, every box annotated with row counts, time, output rows, and bytes. The annotations come from the actual run, not estimates — this is the post-mortem view of what Spark did.

What to look for:

  • Unexpected Exchange operators. Each Exchange is a shuffle. Two Exchanges where you expected one means you accidentally caused a re-shuffle (often by repartitioning twice or breaking a co-partitioning).
  • The join algorithm box. BroadcastHashJoin is fast. SortMergeJoin is the safe default. BroadcastNestedLoopJoin is a code smell; it usually means the optimizer couldn’t pick a strategy because of a non-equi join condition.
  • Row counts on filters. If your filter says “1B rows in, 1B rows out,” the filter isn’t filtering — usually because of a type mismatch.
  • Adaptive Query Execution annotations. With AQE on, the SQL tab is the only place to see what really ran, because the plan changed at runtime. Look for AQEShuffleRead boxes — that’s AQE coalescing or splitting partitions adaptively.

The SQL tab is also where you’ll discover that what df.explain() printed before the run isn’t what actually executed. We’ll dig into that in lesson 56.

4. Storage — the cache truth-teller

The Storage tab lists every cached or persisted DataFrame, with its actual on-cluster size, the storage level (memory only, memory and disk, etc.), the fraction cached, and the per-executor distribution.

What to read first:

  • Fraction cached. If it’s not 100%, you don’t have enough memory and Spark evicted partitions. The cache is partially cold and the next read is going to be partial recompute.
  • Size in Memory. This is what’s actually resident, which is often much bigger than the on-disk Parquet size because of the in-memory format.
  • RDD names. If your code calls .cache() on multiple DataFrames, give them human names with df.createOrReplaceTempView("name") first or set df.rdd.name = "..." so you can tell them apart in the UI.

Storage is how you confirm that .cache() actually did what you expected. If the tab is empty after your action, your cache call wasn’t materialized — caching is lazy, you have to trigger an action that touches the cached DataFrame for it to populate.

5. Executors — per-executor health

One row per executor (plus a row for the driver). Columns:

  • Address — host:port.
  • Status — Active or Dead.
  • RDD Blocks / Storage Memory — how much cached data this executor holds.
  • Disk Used — shuffle and spill data on local disk.
  • Cores — cores assigned to this executor.
  • Active / Failed / Complete Tasks — task throughput.
  • Task Time (GC Time) — total task time and the GC fraction.
  • Input / Shuffle Read / Shuffle Write — bytes processed.

The big one: GC Time as a fraction of Task Time. If an executor is spending 30% of its time in GC, it’s drowning in objects and the JVM is the bottleneck. Either bump memory, reduce per-executor cores, or fix the workload. GC pressure is a silent killer — your job runs, but at half speed, and nothing logs an error.

The other thing to check is Dead executors. If executors keep dying and being replaced, something is wrong — usually OOM kills from the cluster manager. Click the dead executor, look at the executor log, find the kill reason. Lesson 57 is the OOM postmortem lesson.

6. Streaming / Structured Streaming

Only present if you have streaming queries. Per-query stats: input rate, processing rate, batch duration, state operator size. The patterns to watch are processing rate falling behind input rate (you can’t keep up), batch duration creeping up over time (state is growing unbounded — watermark issue), and state operator size hitting tens of GB (you’re holding too much state, time to add a watermark or rethink the keys).

We covered the streaming half in lessons 49-54; this tab is where you’ll diagnose the production version.

7. Environment

Every Spark config, JVM property, classpath entry, and Hadoop config in effect for this driver. Useful when you’re debugging “why does this behave differently in production” — half the time the answer is a config the platform team set that you didn’t know about. Search for spark.sql.adaptive, spark.serializer, spark.executor.memory to start.

The “where do I look first” decision tree

You’ve been paged. The pipeline is slow. Walk through the UI like this:

  1. Jobs tab. Sort by duration. The slow job is the one to click.
  2. Click the slow job → Stages. Sort stages by duration. The slow stage is the one to click.
  3. Click the slow stage → Task table. Click “Summary Metrics.”
    • Max duration ≫ median? Skew. Fix with salting or AQE skew handling.
    • Big spill numbers? Memory pressure. Bump executor memory or reduce partition size.
    • High GC time? Memory pressure too, often more cores per executor than the heap can sustain.
    • Everything balanced and just slow? You’re CPU-bound — need more cores or a smarter query.
  4. Cross-check in the SQL tab. Find the query, look at the operator graph, check the join strategy and the row counts at each filter. You’ll often spot “the filter that did nothing” or “the broadcast that didn’t happen.”
  5. Executors tab. Any dead ones? Any one executor doing all the work? GC time looking healthy?

That’s 90% of Spark debugging. The UI tells you which of those four things is wrong. Reading the UI fluently is the difference between a Spark engineer and someone who runs Spark.

Next lesson: .explain() and how to read the plan before you run it, so the UI confirms what you expected instead of being a surprise.

Search