Tungsten: code generation and the columnar memory layout

Catalyst rewrites your query into the best plan it can come up with. That’s half the story. The other half is how Spark runs that plan, and the answer changed dramatically in 2015 when Project Tungsten landed. RDD Spark and DataFrame Spark are technically the same engine, but in practice they perform like different products, and Tungsten is most of the reason. This lesson is about what Tungsten actually does, why DataFrame code so often beats RDD code by 5–10x, and the small set of plan markers you should learn to recognize.

What Tungsten set out to fix

By 2015, Spark had a problem that lots of in-memory engines had: it was bottlenecked on the JVM. CPUs had gotten dramatically faster, RAM bandwidth had improved, but JVM-based code wasn’t taking advantage. The bottlenecks were specific:

Memory overhead. A Long value in a JVM Object[] row takes 16 bytes for the boxed Long plus 8 bytes for the pointer, plus padding — call it 32 bytes — to store 8 bytes of actual data. A row of ten longs took up 320 bytes when the data was 80. For a system that’s supposed to be in-memory, this was catastrophic.

Garbage collection. Spark’s RDD execution allocated millions of small Java objects per task. The GC pause times that resulted dominated executor time on big workloads. Tuning JVM GC was a permanent part of running Spark in production, and it never really worked well.

Cache-unfriendly access patterns. Java objects live wherever the heap allocator put them, which means iterating over a “row” is a sequence of pointer chases, each one a likely L2 or L3 cache miss. CPUs spent most of their time waiting on memory.

Virtual function call overhead between operators. Each operator in the physical plan was a separate object with an iterator.next() method. To produce one row of the final result, the engine called next() on the top operator, which called next() on its child, and so on down the tree. Every call was a virtual dispatch the JIT couldn’t always inline. For a million rows through a five-operator stage, that’s five million virtual calls.

Tungsten attacked all four. The result is the engine you’ve been using.

Piece 1: managed off-heap memory and UnsafeRow

Tungsten introduced a binary row format called UnsafeRow. Instead of Object[], a row is laid out as a fixed-size header (null bitmap, field offsets) followed by tightly packed binary values. Fixed-width fields like int and double go inline; variable-length fields like strings store their data in a contiguous tail. A row of ten longs takes 88 bytes — the 80 bytes of data plus a small header — instead of 320.

This format lives in memory that Spark manages itself, often off-heap (configured via spark.memory.offHeap.enabled and spark.memory.offHeap.size), allocated in big slabs that Spark divides up. Garbage collection doesn’t touch it. There’s no boxing — a column of int is genuinely four bytes per value.

You don’t interact with UnsafeRow directly in Python. You see its effects: dramatically lower memory pressure per executor, no GC tuning headaches, and the ability to keep hundreds of millions of rows in memory on a single machine without falling over.

Piece 2: cache-conscious computation

Once data is in this dense binary form, you can lay it out for the CPU cache. Tungsten places rows of a partition in contiguous memory regions. Iterating a partition becomes a sequential walk over a buffer, which the CPU prefetcher loves. Compare to RDD Spark, where iterating a partition meant walking a list of Java objects scattered across the heap.

For columnar reads (Parquet, ORC) the win is even bigger. Instead of materializing rows immediately, Spark reads column chunks into Tungsten-managed columnar batches — typically 4,096 values from a single column, packed together. Operations that touch one column at a time — filters, projections, aggregations — read a tight loop over one buffer, stay in L2 cache, and let the CPU’s SIMD units actually do work. A vectorized Parquet reader can be 5–10x faster than a row-at-a-time reader on the same file.

You see this in plans as ColumnarToRow nodes — they mark the boundary where Spark transitions from columnar batch processing (used in scans and some operators) to row-at-a-time processing (used in others). Modern Spark does an increasing amount of work in pure columnar mode and ColumnarToRow shows up later in the plan than it used to.

Piece 3: whole-stage code generation

This is the headliner. Instead of executing the physical plan as a tree of operator objects, Tungsten generates Java bytecode at runtime that fuses adjacent operators into a single tight loop, then JIT-compiles it.

A stage that scans, filters, projects, and aggregates in the old model is four operators with four iterators and four virtual next() calls per row. Whole-stage codegen rewrites that into roughly:

while (input.hasNext()) {
    UnsafeRow row = input.next();
    // Inlined filter
    if (row.getDouble(2) <= 40.0) continue;
    // Inlined projection
    long userId = row.getLong(1);
    double amount = row.getDouble(2);
    // Inlined partial aggregation
    aggBuffer.update(...);
}

One method, one loop, no virtual dispatch, JIT-friendly. The JIT can hoist null checks, vectorize parts, and inline aggressively. The result is often within 2–3x of hand-written tight C, on Java.

The visible signature in the physical plan is the asterisk:

*(2) HashAggregate(keys=[country#X], functions=[partial_sum(amount#Y)])
+- *(2) Project [country#X, amount#Y]
   +- *(2) Filter (amount#Y > 40)
      +- *(2) Scan ExistingRDD[user_id#A, amount#Y, country#X]

Every operator with *(2) is part of stage 2’s generated code. The number is the codegen stage id. When you see a contiguous subtree of *(N) operators with the same N, that’s one fused loop.

When the asterisk is missing, codegen didn’t kick in. The most common reasons:

A regular Python UDF in the plan. The BatchEvalPython node and everything around it can’t be fused with codegen — Spark has to break the stage to ship rows to the Python worker. Lesson 40 said UDFs were expensive; this is the deepest reason why.
Operators that don’t support codegen. A handful of physical operators (some join variants, some aggregation paths, some window functions) don’t have codegen implementations. Spark falls back to the old volcano-style iteration.
Generated code exceeded the JVM bytecode size limit. The JVM caps a single method at 64KB of bytecode. For very wide schemas (hundreds of columns) or very large generated expressions, Spark’s codegen can blow past this. When that happens, Spark catches the error and falls back to the interpreted path. You’ll see this in executor logs as a warning about JaninoRuntimeException or “method too large.” If you see it repeatedly, splitting your query into smaller stages or reducing column count is the fix.

There’s a debugging knob:

spark.conf.set("spark.sql.codegen.wholeStage", "false")

This disables whole-stage codegen for the session. Almost nobody should ever set this. The one case it’s useful is debugging a suspected codegen bug: turn it off, see if your weird behavior goes away, file a Spark bug report. In normal operation it should be on, and on recent Spark versions it’s on by default.

Vectorized readers

Closely tied to Tungsten is the vectorized columnar reader for Parquet and ORC. Instead of decoding one row at a time, the reader pulls a batch of values from a column, applies decoding (RLE, dictionary) over the whole batch, and hands a columnar buffer to the next operator. Filters that match the columnar form run vectorized too.

The relevant configs:

spark.conf.set("spark.sql.parquet.enableVectorizedReader", "true")  # default true
spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")      # default true

Both default to true. The reason to know they exist is debugging: if you see a Scan parquet node feeding a ColumnarToRow very early in the plan, vectorized reading is on. If you see plain row-at-a-time reads, something’s disabled — usually because of an unsupported type (some complex nested types) or a config override.

A concrete benchmark

Take a simple aggregation and compare RDD vs DataFrame on the same data. The numbers are illustrative, not benchmarks you should quote, but the shape is what every team’s seen the first time they try this:

import time
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("TungstenDemo").master("local[*]").getOrCreate()

# 50M rows, two columns
df = spark.range(0, 50_000_000).select(
    (F.col("id") % 1000).alias("group"),
    (F.rand() * 100).alias("value"),
)

# Materialize once so the cost is in the aggregation, not the generation
df.write.mode("overwrite").parquet("/tmp/tungsten_demo")
df = spark.read.parquet("/tmp/tungsten_demo")

# DataFrame path — codegen, vectorized read, columnar batches
t0 = time.time()
df.groupBy("group").agg(F.sum("value")).collect()
print("DataFrame:", time.time() - t0, "seconds")

# RDD path — no codegen, no UnsafeRow benefits, Python objects
t0 = time.time()
(df.rdd
   .map(lambda r: (r["group"], r["value"]))
   .reduceByKey(lambda a, b: a + b)
   .collect())
print("RDD:", time.time() - t0, "seconds")

On my laptop the DataFrame version runs in about 3 seconds and the RDD version in about 30. Almost the entire difference is Tungsten: codegen fusing the filter/project/partial-agg, the vectorized Parquet reader, the columnar batches, the lack of Python-object overhead in the executor JVM. The query plan tells the same story:

df.groupBy("group").agg(F.sum("value")).explain()
# == Physical Plan ==
# AdaptiveSparkPlan isFinalPlan=false
# +- HashAggregate(keys=[group#X], functions=[sum(value#Y)])
#    +- Exchange hashpartitioning(group#X, 200), ENSURE_REQUIREMENTS
#       +- *(1) HashAggregate(keys=[group#X], functions=[partial_sum(value#Y)])
#          +- *(1) ColumnarToRow
#             +- FileScan parquet [group#X, value#Y]
#                Batched: true,
#                PushedFilters: [],
#                ReadSchema: struct<group:bigint,value:double>

The *(1) covers the entire pre-shuffle work: scan, columnar-to-row, partial aggregation, all fused. Batched: true confirms vectorized reading. The shuffle is the only thing not in codegen, and that’s because crossing executors is fundamentally not a per-row code-fusion problem.

The RDD version has none of this. Each row crosses into Python (because the lambda is Python) and back, the partial reduction runs in Python, and the optimizer doesn’t see anything it can rewrite. It’s a pure pipe of operations, no fusion, no batching, no codegen.

This is why almost every “should I use RDDs?” question has the same answer: no. The DataFrame engine’s performance comes from Tungsten and Catalyst working together, and you forfeit both the moment you drop to RDD.

What to remember

Tungsten is three things — managed off-heap memory with the UnsafeRow binary format, cache-conscious columnar layouts, and whole-stage codegen — and they explain most of why DataFrame Spark is fast. Read query plans for *(N) prefixes to confirm codegen is engaging; missing asterisks are the loud warning sign that something (usually a UDF) is breaking fusion. Vectorized Parquet/ORC reads are the gateway drug: the column pruning and predicate pushdown you get from Catalyst feed directly into batched columnar reads, which feed directly into codegen-fused execution. The whole stack is designed to keep dense data flowing through tight loops on modern CPUs.

That closes Module 7. You can now read a Spark plan, predict what the optimizer will do with your code, and recognize when the engine’s superpowers are or aren’t engaging. Module 8 starts next lesson and turns outward: data sources. Parquet, ORC, Avro, JDBC, the Hive metastore, Delta Lake, Iceberg — what they are, how they differ, and which one fits your workload.

References: “Project Tungsten: Bringing Spark Closer to Bare Metal” (https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html) and the Apache Spark SQL performance tuning guide (https://spark.apache.org/docs/latest/sql-performance-tuning.html). Retrieved 2026-05-01.