Big data, in plain English · Narcis Miclaus

Welcome to lesson one of the PySpark course. If you’ve been doing data work for any length of time, you’ve watched the same thing happen to half a dozen colleagues: they open a CSV in Pandas, the kernel chews through 16 GB of RAM, the laptop fan starts sounding like a small aircraft, the kernel dies, and they post in Slack asking whether it’s time to “use Spark.” That moment — the one where Pandas runs out of patience and a bigger hammer is needed — is what this course is about. Sixty lessons, starting with theory and ending with you tuning real production jobs against object storage. By the end you’ll know what Spark is, when to reach for it, when not to reach for it, and how to write PySpark that doesn’t embarrass you in code review.

Before we touch any code, though, we need to talk about the phrase that put Spark on the map in the first place: big data. The term has been so thoroughly mangled by marketing departments since about 2012 that most engineers I work with have a slight flinch reflex when they hear it. That’s fair. Most things sold as “big data” turn out to be a 4 GB CSV that someone forgot to put a dtype= on. But the underlying idea is real, the engineering problems are real, and the reason Spark exists is real. So let’s actually define our terms.

What “big” actually means

Big data is data that doesn’t fit, or doesn’t process, on one computer in a reasonable amount of time. That’s it. There isn’t a magic number of gigabytes or rows above which data becomes Big with a capital B. It’s a relationship between three things: how much data you have, what you’re trying to do with it, and what hardware you have to throw at it.

A 200 GB Parquet file is not big data on a 96-core EC2 instance with 768 GB of RAM and a fast NVMe drive. You can pull the whole thing into memory and call it a day. The same 200 GB file is big data on a MacBook with 16 GB of RAM, where you can’t even open it without paging to disk and grinding to a halt. Same data, different “big.”

The classic definition, the one your manager probably learned from a McKinsey deck around 2014, is the four V’s: Volume, Velocity, Variety, and Veracity. You should know what they are because someone will quote them at you in an interview, but you should also know that two of them are real and two of them are mostly there to fill a slide.

Volume is how much. This one matters. The amount of data you’re processing genuinely changes what tools you can use. Five gigabytes is a Pandas problem. Five terabytes is a Spark problem. Five petabytes is a “we need to talk about your data warehouse” problem. Volume is the V that wakes you up at 3 AM when a job runs out of disk.

Velocity is how fast it arrives. This one also matters, especially in 2026 where streaming has become the default rather than the exception. A job that processes a static daily dump is one shape of problem; a job that has to keep up with 100,000 events per second from a Kafka topic is a fundamentally different shape. Spark has Structured Streaming, we’ll cover it in module 9, and it exists precisely because batch alone stopped being enough about a decade ago.

Variety is how different the formats are — JSON here, Parquet there, a SQL extract over there, an XML feed because someone’s payment provider is stuck in 2009. This is real, but it’s not really a technical hardship the way Volume and Velocity are. It’s mostly a “you need to write more parsers” problem, and any reasonable distributed engine will let you bolt on a connector or write a UDF. Variety is on the list because it makes for a nicely alliterative slide.

Veracity is how trustworthy the data is. This is, with respect, not a property of the data; it’s a property of the data pipeline. If your CRM system is producing junk, no amount of Spark, Snowflake, or sacred geometry will fix that. Data quality is a real problem, but it’s a process and tooling problem, not a “do I need a distributed engine” problem. The fact that this V exists at all is mostly because someone needed a fourth V to round out the rule-of-three into a rule-of-four.

So the honest version: Volume and Velocity are why you reach for distributed compute. Variety and Veracity are why you have a job after you’ve reached for it.

The thresholds, roughly

Here are the rules of thumb I use, and that most data engineers I trust will quote you back if you ask them at lunch. None of these are laws of physics; they shift with hardware, with cloud pricing, with what your data looks like, and with how patient you are. But they’re a useful first cut.

Up to ~10 GB: Pandas is fine. Polars is faster and uses less memory and you should probably be using it instead, but Pandas works. DuckDB is fine. SQLite is fine. A single-node Postgres is fine. None of this is a Spark problem. If someone reaches for a distributed engine to process a 4 GB CSV, they are about to spend three days writing infrastructure for something a df = pd.read_csv(...) would have handled in ten minutes.

~10 GB to ~100 GB: the awkward middle. A modern laptop with 32 or 64 GB of RAM can technically handle this in Pandas if you’re careful with dtypes and don’t mind your machine being unusable while it runs. Polars handles it gracefully. DuckDB handles it brilliantly — it’s purpose-built for exactly this range. A beefy cloud VM handles it without breaking a sweat. You can use Spark here and it will work, but you’re paying the distributed-systems tax (cluster startup time, shuffle overhead, more complicated debugging) for not much benefit. This is the range where the right answer is increasingly “DuckDB on a fat single node,” and where Spark’s marketing has historically oversold itself.

~100 GB to ~1 TB: here Spark starts earning its keep. You can still cram this onto one beefy machine if you have a budget for it, and tools like DuckDB will keep up to a point, but parallelism across multiple machines starts being genuinely faster, especially for the wide-table joins and aggregations that come up in real warehouses. Spark is a perfectly reasonable default in this range. So is Snowflake. So is BigQuery. Pick based on what your team already operates.

~1 TB and up: one machine is hopeless, or at least catastrophically expensive. A distributed engine is the only sane answer. Spark, Trino, Snowflake, BigQuery, Athena, Synapse — different products, same fundamental shape: split the data across many machines, process in parallel, combine the results. This is the home turf of this course.

~100 TB and up: you are now in territory where the interesting problems are not “can I run this query” but “can I run this query without bankrupting the company.” Cost optimization, data layout, partition pruning, predicate pushdown, broadcast joins — all the second-half-of-the-course material — becomes the actual job. Anyone can run a SELECT * against a petabyte. Running it for €4 instead of €4,000 is the skill.

Why parallelism is hard

The naive answer to “my data doesn’t fit on one machine” is “use more machines.” The technical answer is that this is approximately correct but enormously more complicated than it sounds. The reason distributed compute engines exist as their own field of engineering is that all of the following have to work, and have to work under failure, for the abstraction to be useful.

Splitting the data. You have to decide, for any given job, how to chop the input up into pieces small enough to fit on individual workers. That sounds trivial until you remember that the data lives in S3, or HDFS, or Kafka, or all three at once, and the splits have to align with how the storage system itself is laid out. Spark spends a meaningful fraction of its source code on figuring out how to split inputs into “partitions.”

Coordinating the work. Once you have a hundred workers, somebody has to tell each of them what to do, in what order, and what to do when one of them takes ten times longer than the others (the dreaded “straggler”). That somebody is a scheduler, and writing one that works at scale is a small career. Spark’s driver process is exactly this.

Shuffling. Most non-trivial computations — joins, group-bys, sorts — require data to be moved between machines so that all the records with the same key end up on the same worker. This is called a shuffle, and it is the single most expensive operation in a distributed engine. We will spend many lessons later in the course on how to avoid shuffles, minimise shuffles, and survive the shuffles you can’t avoid.

Combining results. After all those workers have done their piece, somebody has to assemble the final answer. Sometimes that’s trivial (concatenating partitions). Sometimes it’s not (a global sort across a hundred machines is a genuinely interesting algorithm).

Handling failure. At scale, failure is not an edge case; it’s a steady-state condition. With 1,000 workers running for an hour, some of them will fail — bad disks, network blips, cloud spot instances getting reclaimed, OOMs from a skewed key, you name it. The framework has to detect failures, re-run the lost work, and keep going without restarting the whole job. This is the single biggest reason you don’t write distributed jobs from scratch in Python. It’s also the reason a Spark job that “should take 10 minutes” sometimes takes 40: somewhere a worker died and three stages got retried.

A distributed compute engine is, in essence, a piece of software that hides all of the above behind an API that looks like you’re working with one big DataFrame. Spark’s whole pitch is that you write code as if you had a single machine with infinite RAM, and the engine quietly handles the splitting, scheduling, shuffling, combining, and recovering for you. When it works, it’s magic. When it doesn’t, you spend a week reading stage logs and the words “shuffle spill” enter your daily vocabulary.

What’s coming next

The next five lessons are still theory, no code yet — that starts in lesson 7 with our first SparkSession. Lesson 2 looks at MapReduce and Hadoop, the system that proved distributed compute could be made tractable for normal programmers, and the model Spark inherited and improved on. Lesson 3 introduces Spark itself: the 2010 Berkeley paper, the in-memory dataset abstraction, and what “100x faster than Hadoop” actually means once you read the fine print. Lesson 4 is the architecture: driver, executors, cluster manager, and how a job actually flows through them. Lesson 5 is the dataset abstractions — RDDs, DataFrames, Datasets — and which one you should actually use in 2026 (spoiler: DataFrames). Lesson 6 closes the theory module with PySpark vs Scala vs SQL: three ways to talk to the same engine, and when each one is the right choice.

Then in module 2 we install Spark, write our first job, and never look at the four V’s again.

For further reading, the canonical reference for everything in this course is the Apache Spark documentation. Bookmark it now; we’ll be back.