PySpark · Programming · Narcis Miclaus

Course

PySpark, from the ground up

A 60-lesson course that starts at 'what is big data' and ends at 'here's the 30-minute health check I run on a Spark cluster I've never seen.' Theory where it matters, code where it counts, no hand-waving.

Published: 60 of 60

See the lessons

Lesson 1

Big data, in plain English

Published on 24 November 2025 9 min read Read

When does data become 'big' in the technical sense, why one machine eventually isn't enough, and where Spark fits in the stack.
- #pyspark
- #spark
- #big-data
- #fundamentals
- #intro
Lesson 2

The MapReduce idea, and why it mattered

Published on 27 November 2025 9 min read Read

Google's 2004 paper, the model that made distributed processing tractable, and why everyone moved on from it within a decade.
- #pyspark
- #spark
- #mapreduce
- #hadoop
- #history
Lesson 3

What Spark is and why it replaced Hadoop MapReduce

Published on 1 December 2025 9 min read Read

Matei Zaharia's 2010 paper, in-memory execution, the DAG, lazy evaluation, and the 100x-faster claim — what it really means and what it doesn't.
- #pyspark
- #spark
- #fundamentals
- #history
Lesson 4

The Spark architecture: driver, executors, cluster manager

Published on 4 December 2025 8 min read Read

How a Spark job actually runs across machines. The driver, the executors, the cluster manager, and the resource model that ties them together.
- #pyspark
- #spark
- #architecture
- #fundamentals
Lesson 5

RDD, DataFrame, Dataset — three APIs, one engine

Published on 8 December 2025 8 min read Read

Why Spark has three APIs, what each is good at, when to use which, and why DataFrames won the day for almost everyone.
- #pyspark
- #spark
- #rdd
- #dataframe
- #api
Lesson 6

PySpark vs Scala Spark: what crosses the wire

Published on 11 December 2025 8 min read Read

How PySpark talks to the JVM, where the performance overhead lives, and when (rarely) you'd actually drop down to Scala.
- #pyspark
- #spark
- #scala
- #python
- #performance
Lesson 7

Installing PySpark locally

Published on 15 December 2025 9 min read Read

Pip-installing PySpark, the Java requirement that always trips people up, and the Windows-only Hadoop winutils gotcha.
- #pyspark
- #spark
- #install
- #setup
Lesson 8

Your first SparkSession

Published on 18 December 2025 8 min read Read

The entry point to any PySpark job. What a SparkSession is, the configurations that matter, and what `local[*]` actually means.
- #pyspark
- #spark
- #sparksession
- #configuration
Lesson 9

Reading data: CSV, JSON, Parquet, and the schema-on-read tradeoff

Published on 22 December 2025 9 min read Read

Three file formats, three default behaviors, and why getting reads right early saves you a hundred problems later.
- #pyspark
- #spark
- #csv
- #json
- #parquet
- #io
Lesson 10

Show, count, collect — the actions every beginner runs first

Published on 25 December 2025 9 min read Read

The three actions every PySpark notebook starts with, the difference between them, and why mixing them up at scale is dangerous.
- #pyspark
- #spark
- #actions
- #dataframe
Lesson 11

Writing data: modes, partitions, and the file-count problem

Published on 29 December 2025 8 min read Read

Save modes, partitioned writes, the difference between many small files and one giant file, and why Parquet is the default for a reason.
- #pyspark
- #spark
- #write
- #io
- #parquet
Lesson 12

Local vs cluster: dev workflow that doesn't lie to you

Published on 1 January 2026 10 min read Read

When local mode is enough, when you need a real cluster, and the bugs that only show up when there are actual executors in the picture.
- #pyspark
- #spark
- #cluster
- #deployment
- #workflow
Lesson 13

Schemas: explicit vs inferred

Published on 5 January 2026 8 min read Read

When to let Spark infer, when to declare your own, and why production code basically always declares.
- #pyspark
- #spark
- #schema
- #dataframe
Lesson 14

Select and filter: the two operations you'll do thousands of times

Published on 8 January 2026 8 min read Read

select, where, filter, and the four ways to refer to a column — including the one that breaks when you have spaces in column names.
- #pyspark
- #spark
- #dataframe
- #select
- #filter
Lesson 15

Adding columns: withColumn, lit, and the chaining trap

Published on 12 January 2026 8 min read Read

How to add or modify columns, why withColumn calls in a loop are a known performance pitfall, and when to use select instead.
- #pyspark
- #spark
- #dataframe
- #withColumn
- #transformation
Lesson 16

Aggregations 101: groupBy, agg, and the catalog of summary functions

Published on 15 January 2026 8 min read Read

groupBy + agg, the basic aggregate functions, multi-column aggregations in one pass, and why agg is a wide transformation.
- #pyspark
- #spark
- #dataframe
- #groupby
- #aggregation
Lesson 17

Sorting at scale: orderBy, sort, and the global-sort cost

Published on 19 January 2026 8 min read Read

How sorting works in a distributed engine, why a global sort is expensive, and the sortWithinPartitions escape hatch.
- #pyspark
- #spark
- #dataframe
- #sort
- #orderby
Lesson 18

Renaming, dropping, casting: the everyday cleanup operators

Published on 22 January 2026 8 min read Read

withColumnRenamed, drop, cast, and the small-but-frequent operations that make up half of any real ETL.
- #pyspark
- #spark
- #dataframe
- #schema
- #etl
Lesson 19

Lazy evaluation: why nothing happens until you ask

Published on 26 January 2026 8 min read Read

Why your transformation chain doesn't actually compute when you call it, what 'lazy' really means in Spark, and the Pandas mental-model adjustment every newcomer has to make.
- #pyspark
- #spark
- #lazy-evaluation
- #fundamentals
- #dag
Lesson 20

Transformations vs actions: the dichotomy and the catalog

Published on 29 January 2026 9 min read Read

Every PySpark operation is either a transformation or an action. Knowing which is which is half of debugging.
- #pyspark
- #spark
- #dataframe
- #actions
- #transformations
Lesson 21

Narrow vs wide transformations: the most important Spark concept

Published on 2 February 2026 9 min read Read

Why some transformations are nearly free and others require shuffling the entire cluster. The single distinction that explains every Spark performance question.
- #pyspark
- #spark
- #transformations
- #shuffle
- #performance
Lesson 22

The DAG: how Spark organizes your job into stages

Published on 5 February 2026 9 min read Read

Visualizing your job as a directed acyclic graph, reading the Spark UI's stages tab, and the relationship between stages and shuffles.
- #pyspark
- #spark
- #dag
- #execution-plan
- #spark-ui
Lesson 23

Caching and persistence: storage levels, when each makes sense

Published on 9 February 2026 8 min read Read

df.cache() and df.persist() — what they actually do, the storage levels Spark offers, and the typical patterns where caching pays off.
- #pyspark
- #spark
- #cache
- #persist
- #performance
Lesson 24

.cache() is not free — when to use it, when it's a trap

Published on 12 February 2026 8 min read Read

Spark's cache and persist sound like magic performance buttons. They're not. Here's when caching actually helps, when it makes things worse, and how to tell the difference.
- #pyspark
- #spark
- #caching
- #performance
Lesson 25

What a shuffle actually is, in physical terms

Published on 16 February 2026 8 min read Read

The network operation at the heart of distributed computing, what happens during one, and why everyone fears it.
- #pyspark
- #spark
- #shuffle
- #performance
- #network
Lesson 26

Joins in PySpark: the seven types and when to use each

Published on 19 February 2026 8 min read Read

Inner, left, right, full outer, semi, anti, cross — what each one does, the syntax, and the everyday use cases.
- #pyspark
- #spark
- #joins
- #dataframe
Lesson 27

Broadcast joins: when small tables ride along on every executor

Published on 23 February 2026 8 min read Read

How broadcast joins skip the shuffle, when Spark picks one automatically, and how to force or disable the behavior.
- #pyspark
- #spark
- #joins
- #broadcast
- #performance
Lesson 28

The skew problem: when one key has 100x the rows

Published on 26 February 2026 9 min read Read

How data skew slows down jobs even when the total work is small, how to spot it in the Spark UI, and what symptoms look like in production.
- #pyspark
- #spark
- #skew
- #performance
- #debugging
Lesson 29

Salting: the standard fix when one key dominates

Published on 2 March 2026 8 min read Read

How to break up a hot key by adding a synthetic random suffix, the worked example, and the cost of the trick.
- #pyspark
- #spark
- #skew
- #salting
- #performance
Lesson 30

PySpark joins that don't blow up the cluster

Published on 5 March 2026 9 min read Read

Why joins are the number-one source of Spark pain, what the shuffle actually does, and the broadcast and salting tricks that turn a 40-minute job into a 4-minute one.
- #pyspark
- #spark
- #performance
- #joins
Lesson 31

What a partition is, physically

Published on 9 March 2026 8 min read Read

Partitions in memory, partitions on disk, and the relationship between partitions and tasks.
- #pyspark
- #spark
- #partitions
- #fundamentals
Lesson 32

spark.sql.shuffle.partitions = 200 and why it's almost always wrong

Published on 12 March 2026 8 min read Read

The single most consequential default in Spark, why it doesn't fit your cluster, and how to tune it for the job at hand.
- #pyspark
- #spark
- #partitions
- #configuration
- #performance
Lesson 33

repartition vs coalesce: two ways to change partition count

Published on 16 March 2026 8 min read Read

When to use which, the cost of each, and the gotcha of accidentally serializing your job to one task.
- #pyspark
- #spark
- #partitions
- #repartition
- #coalesce
Lesson 34

Partitioned writes: directory layout, predicate pushdown, and when to do it

Published on 19 March 2026 7 min read Read

Hive-style partition columns on disk, how Spark uses them at read time to skip files, and the cardinality trap to avoid.
- #pyspark
- #spark
- #partitions
- #parquet
- #predicate-pushdown
Lesson 35

Partitioning: the thing that quietly kills your Spark job

Published on 23 March 2026 9 min read Read

How data gets split across executors, why the default is almost always wrong, and the repartition/coalesce dance that every Spark job eventually needs.
- #pyspark
- #spark
- #partitioning
- #performance
Lesson 36

Bucketing: when partitioning isn't enough

Published on 26 March 2026 8 min read Read

Hash-partitioning into a fixed number of buckets at write time, the bucket join optimization, and why bucketing is underused.
- #pyspark
- #spark
- #bucketing
- #performance
- #joins
Lesson 37

PySpark SQL: when SQL beats DataFrame syntax

Published on 30 March 2026 8 min read Read

Registering temp views, calling spark.sql(), and the cases where the SQL string is genuinely cleaner than the DataFrame chain.
- #pyspark
- #spark
- #sql
- #dataframe
- #temp-view
Lesson 38

Window functions: ranking, lag/lead, running totals

Published on 2 April 2026 8 min read Read

Window.partitionBy().orderBy(), the family of window functions, and why they're the second-most-useful tool after groupBy.
- #pyspark
- #spark
- #window-functions
- #dataframe
Lesson 39

Pivot and unpivot: wide-to-long and back

Published on 6 April 2026 8 min read Read

Reshaping data with pivot(), the trick for unpivoting before Spark 3.4, and the cost of wide tables.
- #pyspark
- #spark
- #pivot
- #unpivot
- #reshape
Lesson 40

UDFs: when you need them, why you should avoid them

Published on 9 April 2026 8 min read Read

The Python serialization tax of regular UDFs, why pandas_udf saves you, and the rare cases where Scala is the only answer.
- #pyspark
- #spark
- #udf
- #pandas-udf
- #performance
Lesson 41

Catalyst: the brain behind every DataFrame

Published on 13 April 2026 8 min read Read

How Spark turns your code into a query plan, the four phases of optimization, and how to read .explain(True).
- #pyspark
- #spark
- #catalyst
- #optimizer
- #explain
Lesson 42

Tungsten: code generation and the columnar memory layout

Published on 16 April 2026 8 min read Read

How Spark fuses operations into compiled code, the off-heap columnar format, and why DataFrame Spark is fast.
- #pyspark
- #spark
- #tungsten
- #performance
- #internals
Lesson 43

Parquet: why it's the default for a reason

Published on 20 April 2026 9 min read Read

Columnar storage explained, compression codecs, predicate pushdown, and the row-group structure that makes selective reads fast.
- #pyspark
- #spark
- #parquet
- #file-format
- #columnar
Lesson 44

ORC, Avro, Delta: the alternatives and when each wins

Published on 23 April 2026 9 min read Read

Three format families that aren't Parquet, when each is the right choice, and why Delta has been quietly taking over.
- #pyspark
- #spark
- #orc
- #avro
- #delta
- #file-format
Lesson 45

Reading from JDBC: pulling from Postgres, MySQL, SQL Server

Published on 27 April 2026 9 min read Read

The JDBC source connector, the partitionColumn trick, and why a naive read kills your source database.
- #pyspark
- #spark
- #jdbc
- #postgres
- #mysql
- #parallel-read
Lesson 46

Writing to JDBC: parallelism, batches, idempotency

Published on 30 April 2026 9 min read Read

How to write Spark output back to a relational database without crushing it, breaking transactions, or losing data on retry.
- #pyspark
- #spark
- #jdbc
- #write
- #transactions
Lesson 47

Cloud storage: S3, GCS, Azure Blob — what changes

Published on 4 May 2026 9 min read Read

The consistency caveats, the rename problem, and why direct-write committers exist.
- #pyspark
- #spark
- #s3
- #cloud
- #storage
- #hadoop
Lesson 48

Schema evolution: when columns change underneath you

Published on 7 May 2026 9 min read Read

Why schema-on-read formats handle change badly, why Avro+registry handles it well, and the Delta/Iceberg way of getting both.
- #pyspark
- #spark
- #schema
- #parquet
- #avro
- #evolution
Lesson 49

Why streaming, and what 'streaming' even means in Spark

Published on 11 May 2026 9 min read Read

Bounded vs unbounded data, batch-vs-streaming as a continuum, and why DStreams are deprecated in favor of Structured Streaming.
- #pyspark
- #spark
- #streaming
- #structured-streaming
- #fundamentals
Lesson 50

Structured Streaming basics: readStream, writeStream, triggers

Published on 14 May 2026 10 min read Read

The streaming entry points, the trigger semantics, and the checkpoint that everything depends on.
- #pyspark
- #spark
- #streaming
- #structured-streaming
- #dataframe
Lesson 51

Kafka source: the most common production ingest

Published on 18 May 2026 10 min read Read

How Spark reads from Kafka, the offset semantics, and the at-least-once vs exactly-once question.
- #pyspark
- #spark
- #kafka
- #streaming
- #structured-streaming
Lesson 52

Watermarks and event time: the part most beginners get wrong

Published on 21 May 2026 7 min read Read

Why event time matters more than processing time, what a watermark actually does, and the worked example with concrete timestamps.
- #pyspark
- #spark
- #streaming
- #watermarks
- #event-time
Lesson 53

Stateful operations: aggregations, sessions, and the state store

Published on 25 May 2026 7 min read Read

Where Spark Streaming keeps the state between micro-batches, the standard stateful patterns, and when to drop down to mapGroupsWithState.
- #pyspark
- #spark
- #streaming
- #state
- #sessionization
Lesson 54

Output modes and idempotent sinks: foreachBatch and the upsert pattern

Published on 28 May 2026 8 min read Read

Append vs update vs complete, the sinks Spark ships, and the foreachBatch escape hatch for everything else.
- #pyspark
- #spark
- #streaming
- #sinks
- #idempotent
- #foreach-batch
Lesson 55

The Spark UI: the most important tool you'll learn

Published on 1 June 2026 9 min read Read

A guided tour of every tab — Jobs, Stages, Tasks, SQL, Storage, Executors — and what each one tells you when something is wrong.
- #pyspark
- #spark
- #ui
- #debugging
- #production
Lesson 56

Reading execution plans: .explain(True), parsed to physical

Published on 4 June 2026 9 min read Read

How to read every line of .explain() output, the operators that matter, and the optimizer steps that produce them.
- #pyspark
- #spark
- #explain
- #execution-plan
- #catalyst
Lesson 57

Memory tuning: executor memory, overhead, OOM diagnostics

Published on 8 June 2026 10 min read Read

The four configs that actually matter, what spill means, how to read an OOM stack trace, and the rule for sizing executors.
- #pyspark
- #spark
- #memory
- #tuning
- #production
Lesson 58

Debugging slow Spark jobs: the 30-minute checklist

Published on 11 June 2026 8 min read Read

The systematic loop for figuring out what's wrong with a slow job — read the UI, find the slow stage, look at task skew, GC, shuffle volume, in that order.
- #pyspark
- #spark
- #debugging
- #performance
- #production
Lesson 59

Adaptive Query Execution: Spark 3.x's killer feature

Published on 15 June 2026 8 min read Read

Dynamic partition coalescing, runtime skew handling, and join strategy switching — the configs to know and the cases AQE still can't help.
- #pyspark
- #spark
- #aqe
- #optimization
- #performance
Lesson 60

A 30-minute health check on a Spark cluster you've never seen

Published on 18 June 2026 12 min read Read

The capstone checklist: hand over your laptop, you have until 5pm to figure out what's broken.
- #pyspark
- #spark
- #dba
- #health-check
- #course-summary