PySpark SQL: when SQL beats DataFrame syntax

Module 7 starts here, and the theme is the optimizer — Catalyst. Before we open the hood on it, there’s a piece of the API most PySpark courses gloss over because it feels like a side road: the SQL interface. Every DataFrame can be turned into a SQL table with one line, queried with a string, and turned back into a DataFrame. It’s not a separate engine, it’s not slower, it’s not legacy. It’s the same Catalyst pipeline coming in through a different door.

The reason it matters for this module is that once you can write the same query two ways — DataFrame and SQL — you start to notice when each one is the right tool. And that decision feeds directly into how readable your pipelines are six months later, when somebody (probably you) is debugging them at 11 PM.

Two doors, one engine

Take any DataFrame. Register it as a temporary view. Query it with SQL.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("PySparkSQL")
         .master("local[*]")
         .getOrCreate())

orders = spark.createDataFrame(
    [
        (1, "IT", 59.0,  "2024-03-15"),
        (2, "IT", 29.0,  "2024-03-15"),
        (3, "NL", 149.0, "2024-04-02"),
        (4, "NL", 89.5,  "2024-04-02"),
        (5, "DE", 12.0,  "2025-01-08"),
        (6, "DE", 240.0, "2025-01-09"),
    ],
    "order_id INT, country STRING, total DOUBLE, dt STRING",
)

orders.createOrReplaceTempView("orders")

by_country_sql = spark.sql("""
    SELECT country, SUM(total) AS revenue
    FROM orders
    GROUP BY country
    ORDER BY revenue DESC
""")

by_country_df = (orders
                 .groupBy("country")
                 .agg(F.sum("total").alias("revenue"))
                 .orderBy(F.col("revenue").desc()))

Both by_country_sql and by_country_df are DataFrames. They go through the same Catalyst parser, the same logical plan rewrites, the same physical planner, the same Tungsten code generation. Run .explain() on each and you will see identical plans, because they are identical plans. The SQL string is just an alternative front end.

This is the most important fact of the lesson: there is no performance difference between the two APIs. None. Stop choosing one over the other “for performance.” Choose based on what reads better.

If you want to see the equivalence with your own eyes, run .explain(mode="extended") on both. The == Parsed Logical Plan == sections will differ slightly — the SQL string starts as an unresolved UnresolvedRelation, the DataFrame call starts already resolved — but by the time you reach == Optimized Logical Plan == and == Physical Plan ==, the two outputs are identical down to the operator IDs (which are session-counter cosmetics). That’s not a coincidence; it’s the architecture. The DataFrame builder methods construct a logical plan tree, and so does the SQL parser. After that, every optimization rule, every cost estimate, every code-generation step happens on a single shared representation.

When SQL is clearly better

There are a handful of cases where the SQL string will, every time, be more pleasant than the equivalent DataFrame chain.

Multi-CTE queries. Spark SQL supports the full CTE syntax — WITH a AS (...), b AS (...), c AS (...) SELECT ... — and if your transformation is naturally three or four named stages, that’s exactly how SQL describes it.

spark.sql("""
    WITH daily_revenue AS (
        SELECT dt, country, SUM(total) AS revenue
        FROM orders
        GROUP BY dt, country
    ),
    country_totals AS (
        SELECT country, SUM(revenue) AS country_revenue
        FROM daily_revenue
        GROUP BY country
    ),
    flagged AS (
        SELECT d.*, c.country_revenue,
               d.revenue / c.country_revenue AS share
        FROM daily_revenue d
        JOIN country_totals c USING (country)
    )
    SELECT * FROM flagged WHERE share > 0.1
""").show()

The DataFrame version of that is a chain of three intermediate variables, each followed by .alias("daily_revenue") and re-joined later. It works, but every reader has to mentally reconstruct what’s basically WITH-block scoping.

Window functions with multiple frames. SQL’s OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...) syntax is, frankly, what window functions were designed for. We’ll cover them properly next lesson, but a query that uses three different windows in the same SELECT is significantly tidier in SQL than in the DataFrame API where you have to declare each Window spec separately and use it in .over(...).

Code that mostly is SQL. When the transformation is “this query against the warehouse, then save,” and there’s no Python control flow in the middle, writing it as DataFrame chains is just translating SQL into Python and asking the next reader to translate it back. Skip that step.

When DataFrames are clearly better

The flip side: if you find yourself building a SQL string with f-strings and .join() calls, you’re working against the grain.

Programmatic generation. A list of columns to aggregate, a dict of renames, a config-driven pipeline — these are at home in Python, miserable in a SQL string.

metrics = ["revenue", "qty", "discount", "tax"]
agg_exprs = [F.sum(c).alias(f"total_{c}") for c in metrics]

(orders
 .groupBy("country")
 .agg(*agg_exprs)
 .show())

Trying to assemble that as "SELECT country, " + ", ".join(...) + " FROM ..." works but invites injection bugs and reads like a template engine.

Tight integration with control flow. If the next step depends on a check against a count, or you’re looping over a list of source paths, or you need to branch based on schema — you’re in Python. Stay in Python.

df = spark.read.parquet(path)

if "country" in df.columns:
    df = df.withColumn("country", F.upper("country"))

df.groupBy("country").count().show()

You can express conditional logic in SQL, but the seams show.

Type-safe column references. F.col("revenue") + F.col("tax") survives an IDE rename and lints. The string "revenue + tax" doesn’t. On a long-lived codebase, the column-reference style pays off.

Reusable column expressions. A complex predicate that gets used in three queries is one variable in DataFrame land:

is_active_paying = (F.col("status") == "active") & (F.col("plan") != "free")

active_orders   = orders.filter(is_active_paying)
active_users    = users.filter(is_active_paying)
active_sessions = sessions.filter(is_active_paying)

The SQL equivalent is the predicate copy-pasted into three WHERE clauses, with all the drift risk that implies. Catalyst won’t save you from typing the same thing three times slightly wrong.

Mixing the two

You don’t have to commit. The most ergonomic pipelines I’ve written take a DataFrame, register it, run a chunk of SQL because that part is genuinely cleaner, and hand the result back to DataFrame syntax for the bits Python does better:

raw = spark.read.parquet("s3://bucket/orders/")
raw.createOrReplaceTempView("raw_orders")

cleaned = spark.sql("""
    WITH dedup AS (
        SELECT *,
               ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY ingested_at DESC) AS rn
        FROM raw_orders
    )
    SELECT order_id, country, total, dt
    FROM dedup
    WHERE rn = 1
""")

# Back to DataFrame for programmatic enrichment
for col in ["country", "dt"]:
    cleaned = cleaned.withColumn(col, F.trim(col))

cleaned.write.mode("overwrite").parquet("s3://bucket/orders_cleaned/")

Use whichever expresses the step the most clearly.

Inspecting what’s registered

spark.catalog is your handle on what views and tables Spark knows about right now.

spark.catalog.listTables()
# [Table(name='orders', database=None, description=None, tableType='TEMPORARY', ...)]

spark.catalog.listColumns("orders")
# [Column(name='order_id', description=None, dataType='int', nullable=True, ...), ...]

spark.catalog.dropTempView("orders")

Useful in notebooks when you’ve registered three views and forgotten which is which, and useful in scripts as a sanity check before a query runs.

Temp views, global temp views, and managed tables

Three levels of “permanent.” Pick whichever matches the lifetime you actually want.

Temp view. df.createOrReplaceTempView("orders"). Lives for the SparkSession. When the session ends, the view is gone. Scoped to the session that registered it — another SparkSession in the same application can’t see it. This is the default and the right answer 95% of the time.

Global temp view. df.createGlobalTempView("orders") (or createOrReplaceGlobalTempView). Lives for the entire Spark application — every SparkSession in the JVM can see it. Lives in the special global_temp database, so you query it as SELECT * FROM global_temp.orders. Useful when you have multiple sessions sharing a Spark application and need to pass a view between them. Almost nobody hits this case.

Managed table. df.write.saveAsTable("warehouse.orders"). Persisted to the metastore (Hive metastore, Glue, Unity Catalog) and the underlying files written to the warehouse directory. Survives restarts. Survives clusters. This is what you want for tables you’ll query tomorrow, next week, from another job, from another team. Module 6 used this for bucketing — same mechanism.

# Session-scoped — gone when the session ends
df.createOrReplaceTempView("session_local")

# App-scoped — survives across sessions in this app
df.createGlobalTempView("app_wide")
spark.sql("SELECT * FROM global_temp.app_wide").show()

# Persisted to the metastore — survives the cluster
df.write.mode("overwrite").saveAsTable("analytics.orders")

The mental model: temp view = local variable, global temp view = module-level variable, managed table = file on disk.

A small worked example end to end

Let’s do the multi-CTE pattern on the orders DataFrame from earlier and confirm both APIs produce the same plan.

orders.createOrReplaceTempView("orders")

# SQL form
sql_result = spark.sql("""
    WITH country_revenue AS (
        SELECT country, SUM(total) AS revenue
        FROM orders
        GROUP BY country
    )
    SELECT country, revenue
    FROM country_revenue
    WHERE revenue > 50
    ORDER BY revenue DESC
""")

# DataFrame form
df_result = (orders
             .groupBy("country")
             .agg(F.sum("total").alias("revenue"))
             .filter(F.col("revenue") > 50)
             .orderBy(F.col("revenue").desc()))

sql_result.explain(mode="formatted")
df_result.explain(mode="formatted")

The two explain outputs differ only in cosmetic ID numbers. Same physical plan, same shuffle count, same wall-clock time. The choice between them is purely about whose eyes will read this code in three months.

Run this on your own machine

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("PySparkSQLDemo")
         .master("local[*]")
         .getOrCreate())

orders = spark.createDataFrame(
    [(i, ["IT", "NL", "DE"][i % 3], float(i * 7 % 200), f"2024-{(i % 12) + 1:02d}-01")
     for i in range(1, 50)],
    "order_id INT, country STRING, total DOUBLE, dt STRING",
)

orders.createOrReplaceTempView("orders")

# Multi-CTE in SQL
spark.sql("""
    WITH monthly AS (
        SELECT country, SUBSTR(dt, 1, 7) AS month, SUM(total) AS revenue
        FROM orders
        GROUP BY country, SUBSTR(dt, 1, 7)
    ),
    ranked AS (
        SELECT *, RANK() OVER (PARTITION BY month ORDER BY revenue DESC) AS rk
        FROM monthly
    )
    SELECT * FROM ranked WHERE rk = 1 ORDER BY month
""").show()

# Same logic, partly DataFrame, partly SQL
df = (orders
      .withColumn("month", F.substring("dt", 1, 7))
      .groupBy("country", "month")
      .agg(F.sum("total").alias("revenue")))

df.createOrReplaceTempView("monthly")
spark.sql("""
    SELECT * FROM (
        SELECT *, RANK() OVER (PARTITION BY month ORDER BY revenue DESC) AS rk
        FROM monthly
    ) WHERE rk = 1
""").show()

# Inspect the catalog
for t in spark.catalog.listTables():
    print(t.name, t.tableType)

Run both. Skim both. Whichever one you find easier to read tomorrow morning is the right one for your team.

That RANK() OVER (PARTITION BY ... ORDER BY ...) you just wrote is a window function — the topic of the next lesson. Window functions are the second-most-useful tool in the SQL toolbox after GROUP BY, and once you have them, you stop reaching for self-joins to compute “row N versus row N-1” patterns.

References: Apache Spark SQL programming guide (https://spark.apache.org/docs/latest/sql-programming-guide.html) and the Spark Catalog API reference. Retrieved 2026-05-01.