PySpark
Distributed dataframes, joins that don't blow up the cluster, and the parts of Spark that bite.
-
Lesson 1
Big data, in plain English
When does data become 'big' in the technical sense, why one machine eventually isn't enough, and where Spark fits in the stack.
-
Lesson 2
The MapReduce idea, and why it mattered
Google's 2004 paper, the model that made distributed processing tractable, and why everyone moved on from it within a decade.
-
Lesson 3
What Spark is and why it replaced Hadoop MapReduce
Matei Zaharia's 2010 paper, in-memory execution, the DAG, lazy evaluation, and the 100x-faster claim — what it really means and what it doesn't.
-
Lesson 4
The Spark architecture: driver, executors, cluster manager
How a Spark job actually runs across machines. The driver, the executors, the cluster manager, and the resource model that ties them together.
-
Lesson 5
RDD, DataFrame, Dataset — three APIs, one engine
Why Spark has three APIs, what each is good at, when to use which, and why DataFrames won the day for almost everyone.
-
Lesson 6
PySpark vs Scala Spark: what crosses the wire
How PySpark talks to the JVM, where the performance overhead lives, and when (rarely) you'd actually drop down to Scala.
-
Lesson 7
Installing PySpark locally
Pip-installing PySpark, the Java requirement that always trips people up, and the Windows-only Hadoop winutils gotcha.
-
Lesson 8
Your first SparkSession
The entry point to any PySpark job. What a SparkSession is, the configurations that matter, and what `local[*]` actually means.
-
Lesson 9
Reading data: CSV, JSON, Parquet, and the schema-on-read tradeoff
Three file formats, three default behaviors, and why getting reads right early saves you a hundred problems later.
-
Lesson 10
Show, count, collect — the actions every beginner runs first
The three actions every PySpark notebook starts with, the difference between them, and why mixing them up at scale is dangerous.
-
Lesson 11
Writing data: modes, partitions, and the file-count problem
Save modes, partitioned writes, the difference between many small files and one giant file, and why Parquet is the default for a reason.
-
Lesson 12
Local vs cluster: dev workflow that doesn't lie to you
When local mode is enough, when you need a real cluster, and the bugs that only show up when there are actual executors in the picture.
-
Lesson 13
Schemas: explicit vs inferred
When to let Spark infer, when to declare your own, and why production code basically always declares.
-
Lesson 14
Select and filter: the two operations you'll do thousands of times
select, where, filter, and the four ways to refer to a column — including the one that breaks when you have spaces in column names.
-
Lesson 15
Adding columns: withColumn, lit, and the chaining trap
How to add or modify columns, why withColumn calls in a loop are a known performance pitfall, and when to use select instead.
-
Lesson 16
Aggregations 101: groupBy, agg, and the catalog of summary functions
groupBy + agg, the basic aggregate functions, multi-column aggregations in one pass, and why agg is a wide transformation.
-
Lesson 17
Sorting at scale: orderBy, sort, and the global-sort cost
How sorting works in a distributed engine, why a global sort is expensive, and the sortWithinPartitions escape hatch.
-
Lesson 18
Renaming, dropping, casting: the everyday cleanup operators
withColumnRenamed, drop, cast, and the small-but-frequent operations that make up half of any real ETL.
-
Lesson 19
Lazy evaluation: why nothing happens until you ask
Why your transformation chain doesn't actually compute when you call it, what 'lazy' really means in Spark, and the Pandas mental-model adjustment every newcomer has to make.
-
Lesson 20
Transformations vs actions: the dichotomy and the catalog
Every PySpark operation is either a transformation or an action. Knowing which is which is half of debugging.
-
Lesson 21
Narrow vs wide transformations: the most important Spark concept
Why some transformations are nearly free and others require shuffling the entire cluster. The single distinction that explains every Spark performance question.
-
Lesson 22
The DAG: how Spark organizes your job into stages
Visualizing your job as a directed acyclic graph, reading the Spark UI's stages tab, and the relationship between stages and shuffles.
-
Lesson 23
Caching and persistence: storage levels, when each makes sense
df.cache() and df.persist() — what they actually do, the storage levels Spark offers, and the typical patterns where caching pays off.
-
Lesson 24
.cache() is not free — when to use it, when it's a trap
Spark's cache and persist sound like magic performance buttons. They're not. Here's when caching actually helps, when it makes things worse, and how to tell the difference.
-
Lesson 25
What a shuffle actually is, in physical terms
The network operation at the heart of distributed computing, what happens during one, and why everyone fears it.
-
Lesson 26
Joins in PySpark: the seven types and when to use each
Inner, left, right, full outer, semi, anti, cross — what each one does, the syntax, and the everyday use cases.
-
Lesson 27
Broadcast joins: when small tables ride along on every executor
How broadcast joins skip the shuffle, when Spark picks one automatically, and how to force or disable the behavior.
-
Lesson 28
The skew problem: when one key has 100x the rows
How data skew slows down jobs even when the total work is small, how to spot it in the Spark UI, and what symptoms look like in production.
-
Lesson 29
Salting: the standard fix when one key dominates
How to break up a hot key by adding a synthetic random suffix, the worked example, and the cost of the trick.
-
Lesson 30
PySpark joins that don't blow up the cluster
Why joins are the number-one source of Spark pain, what the shuffle actually does, and the broadcast and salting tricks that turn a 40-minute job into a 4-minute one.
-
Lesson 31
What a partition is, physically
Partitions in memory, partitions on disk, and the relationship between partitions and tasks.
-
Lesson 32
spark.sql.shuffle.partitions = 200 and why it's almost always wrong
The single most consequential default in Spark, why it doesn't fit your cluster, and how to tune it for the job at hand.
-
Lesson 33
repartition vs coalesce: two ways to change partition count
When to use which, the cost of each, and the gotcha of accidentally serializing your job to one task.
-
Lesson 34
Partitioned writes: directory layout, predicate pushdown, and when to do it
Hive-style partition columns on disk, how Spark uses them at read time to skip files, and the cardinality trap to avoid.
-
Lesson 35
Partitioning: the thing that quietly kills your Spark job
How data gets split across executors, why the default is almost always wrong, and the repartition/coalesce dance that every Spark job eventually needs.
-
Lesson 36
Bucketing: when partitioning isn't enough
Hash-partitioning into a fixed number of buckets at write time, the bucket join optimization, and why bucketing is underused.
-
Lesson 37
PySpark SQL: when SQL beats DataFrame syntax
Registering temp views, calling spark.sql(), and the cases where the SQL string is genuinely cleaner than the DataFrame chain.
-
Lesson 38
Window functions: ranking, lag/lead, running totals
Window.partitionBy().orderBy(), the family of window functions, and why they're the second-most-useful tool after groupBy.
-
Lesson 39
Pivot and unpivot: wide-to-long and back
Reshaping data with pivot(), the trick for unpivoting before Spark 3.4, and the cost of wide tables.
-
Lesson 40
UDFs: when you need them, why you should avoid them
The Python serialization tax of regular UDFs, why pandas_udf saves you, and the rare cases where Scala is the only answer.
-
Lesson 41
Catalyst: the brain behind every DataFrame
How Spark turns your code into a query plan, the four phases of optimization, and how to read .explain(True).
-
Lesson 42
Tungsten: code generation and the columnar memory layout
How Spark fuses operations into compiled code, the off-heap columnar format, and why DataFrame Spark is fast.
-
Lesson 43
Parquet: why it's the default for a reason
Columnar storage explained, compression codecs, predicate pushdown, and the row-group structure that makes selective reads fast.
-
Lesson 44
ORC, Avro, Delta: the alternatives and when each wins
Three format families that aren't Parquet, when each is the right choice, and why Delta has been quietly taking over.
-
Lesson 45
Reading from JDBC: pulling from Postgres, MySQL, SQL Server
The JDBC source connector, the partitionColumn trick, and why a naive read kills your source database.
-
Lesson 46
Writing to JDBC: parallelism, batches, idempotency
How to write Spark output back to a relational database without crushing it, breaking transactions, or losing data on retry.
-
Lesson 47
Cloud storage: S3, GCS, Azure Blob — what changes
The consistency caveats, the rename problem, and why direct-write committers exist.
-
Lesson 48
Schema evolution: when columns change underneath you
Why schema-on-read formats handle change badly, why Avro+registry handles it well, and the Delta/Iceberg way of getting both.
-
Lesson 49
Why streaming, and what 'streaming' even means in Spark
Bounded vs unbounded data, batch-vs-streaming as a continuum, and why DStreams are deprecated in favor of Structured Streaming.
-
Lesson 50
Structured Streaming basics: readStream, writeStream, triggers
The streaming entry points, the trigger semantics, and the checkpoint that everything depends on.
-
Lesson 51
Kafka source: the most common production ingest
How Spark reads from Kafka, the offset semantics, and the at-least-once vs exactly-once question.
-
Lesson 52
Watermarks and event time: the part most beginners get wrong
Why event time matters more than processing time, what a watermark actually does, and the worked example with concrete timestamps.
-
Lesson 53
Stateful operations: aggregations, sessions, and the state store
Where Spark Streaming keeps the state between micro-batches, the standard stateful patterns, and when to drop down to mapGroupsWithState.
-
Lesson 54
Output modes and idempotent sinks: foreachBatch and the upsert pattern
Append vs update vs complete, the sinks Spark ships, and the foreachBatch escape hatch for everything else.
-
Lesson 55
The Spark UI: the most important tool you'll learn
A guided tour of every tab — Jobs, Stages, Tasks, SQL, Storage, Executors — and what each one tells you when something is wrong.
-
Lesson 56
Reading execution plans: .explain(True), parsed to physical
How to read every line of .explain() output, the operators that matter, and the optimizer steps that produce them.
-
Lesson 57
Memory tuning: executor memory, overhead, OOM diagnostics
The four configs that actually matter, what spill means, how to read an OOM stack trace, and the rule for sizing executors.
-
Lesson 58
Debugging slow Spark jobs: the 30-minute checklist
The systematic loop for figuring out what's wrong with a slow job — read the UI, find the slow stage, look at task skew, GC, shuffle volume, in that order.
-
Lesson 59
Adaptive Query Execution: Spark 3.x's killer feature
Dynamic partition coalescing, runtime skew handling, and join strategy switching — the configs to know and the cases AQE still can't help.
-
Lesson 60
A 30-minute health check on a Spark cluster you've never seen
The capstone checklist: hand over your laptop, you have until 5pm to figure out what's broken.
