PySpark vs Scala Spark: what crosses the wire

There’s a piece of Spark folklore that goes “Scala is faster than PySpark.” Like most pieces of folklore, it’s about 30% true and 70% confidently misremembered from 2016. The honest answer in 2026 is more like: PySpark and Scala Spark are equally fast for the work most people actually do, and PySpark gets slow only in specific, well-understood, increasingly avoidable situations.

To know which is which, you have to understand how PySpark talks to the JVM. The architecture is genuinely simple once you see it, and once you see it the performance rules write themselves.

Spark is a Scala project

Spark is written in Scala. It runs on the JVM. The Catalyst optimizer is JVM code. The Tungsten execution engine — the thing that does whole-stage code generation, off-heap memory management, vectorised operations — is JVM code. Every executor is a JVM process. Even the file readers (Parquet, ORC, Avro) are JVM libraries.

There is no Python anywhere in this stack. Spark could not care less about Python.

So how does PySpark exist?

Py4J and the bridge

PySpark is a thin Python layer that sits on top of the Scala API and talks to it through a bridge called Py4J. Py4J is a library — older than Spark, originally a generic Java/Python bridge — that lets a Python process call methods on JVM objects over a local socket.

When your PySpark code does this:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("runehold").getOrCreate()

df = spark.read.parquet("s3://runehold/orders/")

result = (
    df.filter(F.col("country") == "IT")
      .groupBy("product_id")
      .agg(F.sum("amount").alias("total"))
)

result.write.parquet("s3://runehold/outputs/it_totals/")

…here’s what’s actually happening under the hood:

SparkSession.builder...getOrCreate() starts a JVM (if there isn’t one yet) and creates a Java SparkSession object inside it. Py4J returns a Python proxy object that holds a reference to it.
spark.read.parquet(...) is a Python method on that proxy that, internally, calls the Java DataFrameReader.parquet(...) method. The result is a Java Dataset<Row> in the JVM. Py4J returns another Python proxy.
df.filter(F.col("country") == "IT") constructs a Java Column expression — a tree of JVM objects representing the comparison — and calls Java Dataset.filter(Column). Another proxy.
groupBy, agg, sum, alias — all the same. Each one builds a JVM expression tree and returns a JVM Dataset proxy.
result.write.parquet(...) triggers an action. The driver hands the (entirely JVM-side) DAG to Catalyst, the plan is optimised, executors run it, and Parquet files appear in S3.

Notice what didn’t happen: data did not flow through Python. None of the rows in the Parquet files ever touched a Python process. The driver Python process built up a description of what to do, the description was translated to JVM expressions over Py4J, and the JVM did all the actual work.

This is why DataFrame PySpark and DataFrame Scala Spark are essentially the same speed. Both are programming the same JVM engine. The Python side is just sending instructions.

The Py4J overhead is real but tiny — a few microseconds per API call to send the message and receive the proxy. Compared to a job that processes billions of rows on dozens of executors, the Py4J chatter is rounding error.

Where Python actually pays a tax

Now for the situations where PySpark is genuinely slower than Scala. There are three of them, all related, and all involving data that has to leave the JVM and enter a Python process.

Python UDFs

A Python UDF is a regular Python function you register with Spark and call inside a DataFrame operation:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def normalise_country(s):
    if s is None:
        return None
    return s.strip().upper()

df.withColumn("country_norm", normalise_country(F.col("country")))

This looks innocent. It is not.

Here’s what actually happens at runtime, for every row, on every executor:

The executor JVM has the row in JVM memory.
To run your Python function, the JVM needs Python. Spark spawns a Python worker process alongside the executor (one per task running in parallel).
For each row, the JVM serialises the column value (using Pickle), sends it over a local socket to the Python worker, the Python worker deserialises it, runs your function, serialises the result, sends it back, and the JVM deserialises it.
Repeat for every row. A million rows = a million round trips of serialise/socket/deserialise.

This is the famous JVM-to-Python serialisation tax, and on a non-trivial dataset it’s brutal. A job that takes 30 seconds with native functions can take 30 minutes with a Python UDF doing the equivalent transformation. I have personally seen 50x slowdowns. Production teams have horror stories.

The fix in 90% of cases is to not write a Python UDF in the first place. Spark’s pyspark.sql.functions module has hundreds of built-in operations — upper, trim, regexp_replace, when, coalesce, date_format, concat_ws, from_json, you name it. They run in the JVM. They’re fully visible to Catalyst. They are free.

The example above could be rewritten as:

df.withColumn("country_norm", F.upper(F.trim(F.col("country"))))

Same result, hundreds of times faster. Always check pyspark.sql.functions before reaching for @udf.

pandas_udf (Arrow-based)

Sometimes you genuinely need Python — there’s a pandas-only library, a sklearn model you want to score with, a piece of logic that’s too gnarly to express in native functions. For those cases, PySpark has pandas_udf, also known as a Vectorised UDF.

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd

@pandas_udf(DoubleType())
def score_orders(amounts: pd.Series) -> pd.Series:
    return amounts * 1.22 + 5.0  # whatever your real logic is

df.withColumn("scored", score_orders(F.col("amount")))

Two things are different. First, your function takes and returns a pandas.Series (or DataFrame), not single values. Second, the bridge between the JVM and Python uses Apache Arrow — a columnar in-memory format that both sides can read directly without per-row serialisation.

The performance difference is enormous. A pandas_udf processing a million rows runs roughly 10x to 100x faster than the equivalent plain @udf. The cost is mostly the cost of running Python, not the cost of the bridge.

The rule: if you must run Python on rows, use pandas_udf first. Plain @udf is the fallback when you can’t. Most modern PySpark codebases never use plain UDFs at all.

RDD operations in PySpark

The third place Python pays a tax is when you use the RDD API in PySpark. Every RDD operation is a Python lambda that has to actually execute in a Python process, with the same serialise-over-socket dance as a plain UDF. This is why the previous lesson said “use DataFrames” three times.

PySpark RDDs are roughly an order of magnitude slower than Scala RDDs for the same work, because the serialisation cost is per row and unavoidable. If you’ve decided to use RDDs and you need performance, that’s a real reason to consider Scala. Outside that case, the Python penalty is negligible.

Anecdote: which language do real teams pick?

In every data engineering team I’ve worked with or talked to in the last five years, the language choice has gone the same way:

The team already knows Python. Pandas, NumPy, scikit-learn, requests, the whole ecosystem.
The data scientists already know Python. They want to share notebooks with the engineering team.
The platform team wants one language across analytics, ML, and pipelines.
PySpark is fast enough.

So the team picks PySpark. There are no Scala compilation errors. There’s no sbt. There are no case classes. New hires read the codebase on day one. The data scientists contribute features without learning a new language.

When does Scala win? Almost exclusively in three cases:

Legacy. A 2015-era codebase written against the Scala API. Rewriting it isn’t worth it.
Library development. If you’re writing a library that ships across many Spark applications — a custom data source, a custom optimizer rule, a Catalyst extension — Scala is the native API and you’ll have a better time.
Heavy custom UDFs that can’t fit pandas_udf. A small minority of jobs. If your hot path is a row-by-row transformation that genuinely needs JVM speed and can’t be expressed in native functions or Arrow-based pandas_udf, you write a Scala UDF. The Scala UDF is a JAR you register from PySpark, and your Python code calls it. You don’t have to write the whole pipeline in Scala — just the hot kernel.

I have not seen a greenfield analytics project pick Scala Spark over PySpark since around 2020. Databricks’ own examples are mostly PySpark. Most managed Spark platforms default to PySpark. The Scala camp is real, professional, and shrinking.

A practical rulebook for PySpark performance

Pin these to the wall:

Read columnar formats (Parquet, Delta, Iceberg). Spark pushes filters and column pruning into the file scan; the JVM never reads what you don’t need.
Stay in the DataFrame API. Don’t drop to RDDs unless you can articulate why.
Use pyspark.sql.functions for transformations. There are hundreds. Read the module page once a quarter.
No plain @udf. If you reach for it, stop and check whether pyspark.sql.functions has the equivalent. Almost always it does.
If you need Python on rows, use pandas_udf. Arrow-based, vectorised, 10–100x faster than plain UDF.
Profile the slow part with the Spark UI. If you’ve followed 1–5 and a job is still slow, the bottleneck is data layout (skew, partitioning, file size) and not Python.
Reach for Scala only when the cost-benefit is clear. A custom Catalyst rule, a hot Scala UDF you call from Python, or a legacy migration. Not “because Scala is faster.”

The headline: PySpark in 2026 is the right default for almost everyone. The performance gap that used to exist has been largely closed by Catalyst, Tungsten, and Arrow-based UDFs. The Python ecosystem advantage has only grown. The team productivity argument is overwhelming.

End of Module 1. You now have the mental model: what Spark is, why it exists, the architecture (driver, executors, cluster manager), the three APIs (RDD, DataFrame, Dataset), and how PySpark relates to Scala underneath. From lesson 7 we install PySpark, set up a local environment, and start writing code.

References

Apache Spark — PySpark Overview: https://spark.apache.org/docs/latest/api/python/index.html
Apache Spark — Python User-Defined Functions: https://spark.apache.org/docs/latest/api/python/user_guide/sql/python_udf.html
Apache Spark — Pandas API on Spark and Arrow Optimization: https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html
Py4J — Python-to-Java bridge: https://www.py4j.org/
Apache Arrow — Columnar in-memory format: https://arrow.apache.org/

Retrieved 2026-05-01.