Reading from JDBC: pulling from Postgres, MySQL, SQL Server

We’ve spent the last two lessons on file-based sources. But a huge fraction of real-world data lives in transactional databases — Postgres, MySQL, SQL Server, Oracle — and at some point you’ll need to pull from them into Spark for analytics, joins, ML training data, anything.

Spark’s JDBC connector is the bridge. It’s deceptively simple to call: one spark.read.format("jdbc") and you have a DataFrame. It’s deceptively easy to misuse: one wrong call and you melt your production database while your colleague’s report query times out and your DBA pages you at 11 PM.

This lesson is the safe-and-fast playbook for reading from JDBC sources. Next lesson handles writing.

The basic call

df = (spark.read
        .format("jdbc")
        .option("url", "jdbc:postgresql://db.example.com:5432/orders")
        .option("dbtable", "public.orders")
        .option("user", "spark_reader")
        .option("password", "...")
        .option("driver", "org.postgresql.Driver")
        .load())

A few things to notice immediately:

The url is a JDBC URL, including the database name, not just a host. Format is driver-specific: jdbc:postgresql://..., jdbc:mysql://..., jdbc:sqlserver://..., jdbc:oracle:thin:@....
dbtable can be a table name or a subquery in parentheses with an alias: "(SELECT id, amount FROM orders WHERE created_at > '2026-01-01') t". The latter is how you push filtering to the database when Spark’s pushdown isn’t smart enough.
driver names the JDBC driver class. Each database needs its own — org.postgresql.Driver, com.mysql.cj.jdbc.Driver, com.microsoft.sqlserver.jdbc.SQLServerDriver. The driver JAR has to be on Spark’s classpath; we’ll cover that in a moment.
Credentials in code are bad. Use environment variables, secret managers, or .option("user", os.environ["DB_USER"]). Don’t commit a password.

That’s the simple version. Now the trap.

Why the simple version is dangerous

Run that code against a 200-million-row table. What happens?

Spark builds a single JDBC connection, sends one query — SELECT * FROM public.orders — and reads the entire result set through one cursor. The whole job runs on one executor on one task. Two hundred parallel CPUs sit idle. The source database, meanwhile, is generating two hundred million rows for a single connection that’s pulling them as fast as it can.

This is bad in three ways:

You get no parallelism on the Spark side. Your fancy 50-node cluster is useless.
You hold a single long-running query open against the source DB. Postgres takes a snapshot, MySQL holds locks (depending on isolation level), SQL Server may grow tempdb. Your colleague’s transactional query waits.
If the read fails halfway, you start over from zero. No partial progress.

I have personally watched a junior engineer take down a production Postgres replica by running this exact code against a 500 GB orders table on a Friday afternoon. The DBA was very polite about it.

Parallel reads: the four magic options

Spark gives you a way out. Four options that work together:

df = (spark.read
        .format("jdbc")
        .option("url", "jdbc:postgresql://db.example.com:5432/orders")
        .option("dbtable", "public.orders")
        .option("user", "spark_reader")
        .option("password", os.environ["DB_PASSWORD"])
        .option("driver", "org.postgresql.Driver")
        .option("partitionColumn", "id")
        .option("lowerBound", 1)
        .option("upperBound", 200_000_000)
        .option("numPartitions", 32)
        .option("fetchsize", 10_000)
        .load())

What this does: Spark takes the range [lowerBound, upperBound] on partitionColumn, splits it into numPartitions equal chunks, and runs each chunk as its own SQL query in its own task on its own executor. With numPartitions=32, Spark issues 32 queries to the source DB, roughly:

SELECT * FROM public.orders WHERE id <  6_250_001
SELECT * FROM public.orders WHERE id >= 6_250_001 AND id < 12_500_001
SELECT * FROM public.orders WHERE id >= 12_500_001 AND id < 18_750_001
...
SELECT * FROM public.orders WHERE id >= 193_750_001

Now you have 32-way parallelism on the Spark side. The source DB sees 32 concurrent connections instead of 1, each scanning a smaller range. Total wall-clock time drops dramatically. If a single task fails, only that range is retried.

The constraints to make this work well:

partitionColumn should be numeric or a date/timestamp. Spark builds range filters on it; arbitrary text columns don’t work.
It should be roughly uniformly distributed across [lowerBound, upperBound]. Auto-increment primary keys are perfect. UUIDs are not (unless you have a numeric shadow column). Timestamps work if rows are spread over the range.
It should be indexed on the source DB, otherwise each of those parallel queries does a full table scan, and you’ve made things worse, not better.
lowerBound and upperBound don’t filter the data. Rows below lowerBound end up in the first partition; rows above upperBound end up in the last. The bounds only define how Spark splits the range. If you set them wildly wrong, you’ll get terrible skew — one partition with 99% of the rows.
numPartitions is the parallelism. Match it roughly to your executor cores, but cap it at what the source DB can tolerate concurrently. 32-64 is a reasonable range; 500 will get you blocked by the DBA.

A safe pattern is to fetch the actual min and max of partitionColumn first, in a tiny query, and use those:

bounds = (spark.read.format("jdbc")
          .option("url", url)
          .option("dbtable", "(SELECT MIN(id) AS lo, MAX(id) AS hi FROM public.orders) b")
          .option("user", user).option("password", pw)
          .option("driver", driver)
          .load()
          .first())

df = (spark.read.format("jdbc")
        .option("url", url)
        .option("dbtable", "public.orders")
        .option("user", user).option("password", pw)
        .option("driver", driver)
        .option("partitionColumn", "id")
        .option("lowerBound", bounds["lo"])
        .option("upperBound", bounds["hi"])
        .option("numPartitions", 32)
        .load())

The `predicates` alternative

The four-option pattern only works for monotonic numeric or date columns. What if your data is partitioned naturally by something else — a country code, a tenant ID, a status enum? Use the predicates option, which takes an explicit list of WHERE clauses, one per task:

predicates = [
    "country = 'IT'",
    "country = 'FR'",
    "country = 'DE'",
    "country = 'ES'",
    "country IN ('US', 'CA', 'MX')",
    "country NOT IN ('IT', 'FR', 'DE', 'ES', 'US', 'CA', 'MX')",
]

df = spark.read.jdbc(
    url=url,
    table="public.orders",
    predicates=predicates,
    properties={"user": user, "password": pw, "driver": driver},
)

Each predicate becomes one task running one SQL query. The predicates option is more flexible than partitionColumn because you control exactly what each task fetches, but it’s also more error-prone: the predicates must collectively cover every row exactly once. If they overlap, you get duplicates. If they have gaps, you get missing rows. There’s no validation — Spark trusts you.

A useful pattern: split by hash:

predicates = [f"MOD(id, 32) = {i}" for i in range(32)]

Even with non-numeric partitioning needs, this gives you 32-way parallelism with no overlap or gaps. The cost is each task does a function-call filter, which usually means a full scan unless you have a function-based index. Worth it for moderate-size tables, painful for huge ones.

The driver JAR

JDBC needs a driver JAR for each database type. Spark doesn’t ship them. You have to provide them.

Three ways to attach a driver:

# 1. --jars when you launch
spark-submit --jars /path/to/postgresql-42.7.0.jar my_job.py

# 2. --packages with Maven coordinates (downloads automatically)
spark-submit --packages org.postgresql:postgresql:42.7.0 my_job.py

# 3. In the SparkSession config (works for local development)
spark = (SparkSession.builder
         .config("spark.jars.packages", "org.postgresql:postgresql:42.7.0")
         .getOrCreate())

If you forget the JAR, you get java.lang.ClassNotFoundException: org.postgresql.Driver on first read. Easy diagnostic, easy fix.

Common Maven coordinates:

Postgres: org.postgresql:postgresql:42.7.0
MySQL: com.mysql:mysql-connector-j:8.3.0
SQL Server: com.microsoft.sqlserver:mssql-jdbc:12.4.2.jre11
Oracle: com.oracle.database.jdbc:ojdbc11:23.3.0.23.09 (license-permitting)

Match the version to your database major version when in doubt; mismatches usually work but occasionally produce strange type-coercion bugs.

Pushdown and its limits

Spark tries to push filters and projections down to the database. Simple cases work:

df = (spark.read.format("jdbc")
        .option("url", url).option("dbtable", "public.orders")
        ...
        .load())

(df.filter("country = 'IT'")
   .select("order_id", "amount")
   .explain())

# == Physical Plan ==
# *(1) Scan JDBCRelation(public.orders) [order_id#0,amount#3]
#   PushedFilters: [*EqualTo(country,IT)],
#   ReadSchema: struct<order_id:bigint,amount:double>

The PushedFilters and ReadSchema work the same way as Parquet (lesson 43): Spark sends a query like SELECT order_id, amount FROM public.orders WHERE country = 'IT' instead of pulling the whole table. The asterisk before EqualTo indicates the filter is fully pushed down — Spark won’t re-evaluate it.

What pushes down: equality, comparison, IN with a small set, IS NULL / IS NOT NULL, AND of pushable predicates. What doesn’t: function calls (UPPER(country) = 'IT'), arithmetic on the filtered column, OR with at least one non-pushable side, complex CASE expressions. When pushdown fails, Spark pulls more data than you expect and filters in-cluster. Always check .explain() before assuming a filter pushed down.

When pushdown isn’t enough, fall back to the subquery trick:

.option("dbtable", """
    (SELECT order_id, amount, country
     FROM public.orders
     WHERE country IN ('IT', 'FR', 'DE')
       AND created_at > '2026-01-01') t
""")

The database executes the subquery; Spark sees the result. You give up some readability but you control exactly what’s sent. Combine with partitionColumn for parallel reads on the subquery result — but note that partitionColumn must reference a column that’s in the subquery output.

`fetchsize` and other knobs

fetchsize controls how many rows the JDBC driver pulls per network round-trip. Default is driver-specific and usually small (Postgres: 0, meaning “all at once” — disastrous; MySQL: arrives in batches but you should still tune this). Set it explicitly:

.option("fetchsize", 10_000)

Bigger fetchsize means fewer round-trips, which is faster on high-latency connections. Cost: each task buffers fetchsize rows in driver memory before passing them on. 1,000-50,000 is typical. Tune up if reads are slow and the executor has memory headroom; tune down if you see OOM in the JDBC tasks.

For Postgres specifically, also set option("autocommit", "false") if you want fetchsize to actually take effect — the JDBC driver requires it for cursor-based fetching. Other databases have their own quirks; consult the driver docs when tuning matters.

A few more options worth knowing:

isolationLevel — the transaction isolation Spark uses. Default READ_UNCOMMITTED for most drivers, which is fine for analytics on a stable table. If you’re reading a table being heavily written, REPEATABLE_READ gives you a consistent snapshot at the cost of more locking.
queryTimeout — how long a single task’s query can run before being killed. Default unlimited. Set this in production so a runaway query doesn’t hang forever.
sessionInitStatement — SQL run on each connection before the main query. Useful for SET search_path, statement timeouts, or read-only mode.

The DON’T list

A few patterns that look reasonable and aren’t:

Don’t run analytical Spark jobs against your prod transactional DB. Even with parallel reads, you’re competing with the application traffic that DB exists to serve. The right architecture is replication: nightly snapshot, change-data-capture stream, or a managed service like Postgres logical replication into a warehouse, S3 export, or Debezium → Kafka → Avro lake. Then run Spark against the replica or the lake.

Don’t read a transactional table directly when you really want the latest snapshot for analytics. If your fact table changes every minute, an unbucketed range-partitioned read takes hours, and the data has changed by the time you finish. Snapshot the table once into Parquet, then iterate against the Parquet copy.

Don’t tune numPartitions upward without checking with the DB owner. Postgres can usually handle 32 concurrent reads. SQL Server tolerates more. Oracle is fussier. The DBA knows the connection limit, the connection pool size, and which times of day are busy. Ask.

Don’t use dbtable with a complex subquery and partitionColumn together without testing. The interaction is subtle: the partition filter wraps the subquery, and some databases handle that worse than others. Always run a single partition first to check the executed SQL is reasonable, then scale up.

Try this

A self-contained example using a local Postgres (Docker is easiest):

from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("JdbcDemo")
         .master("local[*]")
         .config("spark.jars.packages", "org.postgresql:postgresql:42.7.0")
         .getOrCreate())

url = "jdbc:postgresql://localhost:5432/demo"
props = {
    "user": "spark_reader",
    "password": "...",
    "driver": "org.postgresql.Driver",
    "fetchsize": "10000",
}

# Sequential read — one task, one query
slow = spark.read.jdbc(url=url, table="public.orders", properties=props)
print(slow.rdd.getNumPartitions())   # 1

# Parallel read — 16 tasks, 16 queries
fast = spark.read.jdbc(
    url=url,
    table="public.orders",
    column="id",
    lowerBound=1,
    upperBound=2_000_000,
    numPartitions=16,
    properties=props,
)
print(fast.rdd.getNumPartitions())   # 16

# Predicates pattern — useful when there's no monotonic column
preds = [f"MOD(id, 8) = {i}" for i in range(8)]
hashed = spark.read.jdbc(
    url=url, table="public.orders",
    predicates=preds, properties=props,
)
print(hashed.rdd.getNumPartitions())  # 8

Compare the wall-clock times of the three reads. The parallel one should be visibly faster on any non-trivial table. Look at pg_stat_activity on the Postgres side while a parallel read runs — you’ll see the 16 concurrent queries, each scanning its slice.

Next lesson, writing back to JDBC: the analogous parallelism story for inserts, why batch sizes matter even more on the write side, and the upsert/merge gymnastics required to update existing rows from a Spark DataFrame.

References: Apache Spark JDBC data source documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) and PostgreSQL JDBC driver guide (https://jdbc.postgresql.org/documentation/). Retrieved 2026-05-01.