You can read forty pages about RDDs, DataFrames, lineage and the Catalyst optimiser, but if from pyspark.sql import SparkSession doesn’t run on your machine, none of it matters. Today we install PySpark, get a SparkSession running, and put a tiny DataFrame on screen. By the end of this lesson you can copy-paste the examples from every following lesson and have them work.
There are three reasonable ways to get PySpark running. We’ll do all three quickly so you can pick yours, then deal with the one piece of Windows-specific weirdness that catches everyone the first time: winutils.exe.
Three install paths, pick one
Path A — pip-install PySpark. One command, runs on Windows / macOS / Linux, gives you everything you need to write and run PySpark scripts. No spark-submit, no master/worker daemons, no cluster — just the Python API and an embedded local Spark that runs inside your Python process. This is what 90% of learners and most data engineers use day-to-day. Recommended for this course.
Path B — full Apache Spark distribution. Download the tarball from the Spark website, extract it, set SPARK_HOME. You get the same PySpark API plus all the shell scripts: spark-submit, spark-shell, start-master.sh, start-worker.sh. Useful if you want to simulate a real cluster on your laptop, or run JVM Spark jobs alongside Python. Overkill for now.
Path C — Databricks Community Edition. Free cloud notebooks running on a tiny managed Spark cluster. No install, no Java, no winutils. Great if your laptop is old or you don’t want to fight your environment. Sign up at community.cloud.databricks.com. The downside: you’re stuck in their notebook UI, can’t run local scripts, and the cluster shuts down after idle periods.
For the rest of this course I’ll assume Path A (pip install). If you go with C, just paste the code into a notebook cell and skip the install steps.
The Java requirement
Spark is a JVM project. PySpark is a Python wrapper that talks to a JVM in the background through a bridge called Py4J. You can’t escape needing Java on the machine.
Spark 3.5.x — the current stable line as of writing — supports Java 8, 11, and 17. Spark 4.x (preview) requires Java 17 or newer. Anything else, and you’ll either get a startup error or, worse, mysterious crashes deep into a job.
Check what you have:
java -version
You want output like openjdk version "17.0.10" or "11.0.22". If you see 'java' is not recognized (Windows) or command not found (Mac/Linux), you don’t have Java installed. If you see Java 21, downgrade — Spark 3.5 doesn’t officially support 21 yet and you’ll hit weird Hadoop-side issues.
The cleanest distribution to install is Eclipse Temurin (the community-built OpenJDK), from adoptium.net. Pick Temurin 17 LTS. Run the installer. On Windows, tick “Set JAVA_HOME variable” and “Add to PATH” during install — both are off by default and forgetting them is the #1 cause of JAVA_HOME is not set errors.
After installing, open a fresh terminal (env vars don’t propagate to existing shells) and re-run java -version. Should work.
Verify JAVA_HOME is set:
# macOS / Linux
echo $JAVA_HOME
# Windows (PowerShell)
$env:JAVA_HOME
# Windows (cmd)
echo %JAVA_HOME%
It should point at something like C:\Program Files\Eclipse Adoptium\jdk-17.0.10.7-hotspot or /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home.
Python: 3.8 or newer
PySpark 3.5 supports Python 3.8 through 3.12. Python 3.13 isn’t officially supported yet and PyArrow (a transitive dependency) doesn’t have stable wheels for it. If you’re on 3.13 you’ll likely fight install errors; downgrade to 3.12 or use pyenv/conda to manage versions.
python --version
# Python 3.11.7
Use a virtual environment. Always.
python -m venv .venv
# Activate it:
# macOS / Linux
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1
# Windows (cmd)
.venv\Scripts\activate.bat
Path A: pip install pyspark
With Java installed, your virtual environment activated, and Python 3.8–3.12 in place:
pip install pyspark==3.5.1
Pinning the version is a good habit. Without it you’ll get whatever the latest is, which today is 3.5.1 but could be 3.5.5 or 4.0.0 by next month. The lessons that follow assume 3.5.x.
The download is around 280 MB — most of it is the bundled Spark JAR files. Be patient.
While we’re here, install one more thing:
pip install pyarrow
PyArrow isn’t strictly required, but it dramatically speeds up the Pandas-to-Spark conversions and toPandas() calls we’ll use later. Modern PySpark warns at startup if it’s missing.
Path B (optional): full Spark distribution
If you want spark-submit and friends, do this in addition to Path A.
- Go to spark.apache.org/downloads.
- Pick Spark 3.5.1, package type “Pre-built for Apache Hadoop 3.3 and later.”
- Download the
.tgz. Extract it somewhere stable. I keep mine at~/sparkon macOS andC:\sparkon Windows. - Set
SPARK_HOMEto that folder, and add$SPARK_HOME/binto yourPATH.
# macOS / Linux: add to ~/.zshrc or ~/.bashrc
export SPARK_HOME=$HOME/spark/spark-3.5.1-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
# Windows: System Environment Variables → New → SPARK_HOME = C:\spark\spark-3.5.1-bin-hadoop3
# Then edit Path → Add %SPARK_HOME%\bin
Verify:
spark-submit --version
You should see a Spark logo in ASCII art. If you do, the distribution is wired up correctly. We won’t use it much in this course, but it’s there.
The Windows-only winutils dance
Now for the part everyone trips on.
Spark uses Hadoop’s filesystem code internally — even when you’re not actually using Hadoop. On Linux and macOS, Hadoop’s I/O layer just calls into the OS. On Windows, Hadoop wants a specific binary called winutils.exe plus a couple of native DLLs. They don’t ship with Hadoop or Spark. You have to download them separately.
If you skip this, the symptom is one of:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binariesUnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows- A warning at startup that Spark ignores file permissions (less catastrophic, but ugly).
The fix:
- Pick a Hadoop version. PySpark 3.5 ships against Hadoop 3.3, so grab
winutilsfor Hadoop 3.3. - Download the binaries from the community-maintained repo at github.com/cdarlint/winutils. Specifically the
hadoop-3.3.x/bin/folder. - Put them somewhere stable — I use
C:\hadoop\bin\. Bothwinutils.exeandhadoop.dllshould sit in thatbinfolder. - Set
HADOOP_HOME=C:\hadoopand add%HADOOP_HOME%\binto yourPATH.
# PowerShell, persistent
[Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\hadoop", "User")
$env:Path += ";C:\hadoop\bin"
Open a fresh terminal. Verify:
winutils.exe ls C:\
If that prints a directory listing, you’re done. PySpark will find winutils.exe via HADOOP_HOME and stop complaining.
Mac and Linux users: skip this whole section. Hadoop’s POSIX path works fine on your OS and you don’t need any of this.
The five-line sanity check
Save this as hello_spark.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("HelloSpark").getOrCreate()
df = spark.createDataFrame([(1, "Narcis"), (2, "Spark"), (3, "Italy")], ["id", "name"])
df.show()
spark.stop()
Run it:
python hello_spark.py
The first time you run this, expect a wall of INFO and WARN log lines. That’s normal — Spark is loud by default. Somewhere in the noise you’ll see:
+---+------+
| id| name|
+---+------+
| 1|Narcis|
| 2| Spark|
| 3| Italy|
+---+------+
If you see that table, your install is good. Move on to the next lesson.
Quieting the logs
The default log level is WARN, but on startup Spark prints a lot of INFO lines from Hadoop and the JVM. To get a cleaner output:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("HelloSpark")
.getOrCreate())
spark.sparkContext.setLogLevel("WARN") # or "ERROR" for even quieter
df = spark.createDataFrame([(1, "Narcis"), (2, "Spark")], ["id", "name"])
df.show()
spark.stop()
This sets the level after the session is built. We’ll cover the cleaner approach (a log4j2.properties file in SPARK_HOME/conf/) when we get to operational topics in Module 8.
The first-run errors and how to fix them
These four hit roughly 80% of new installs. Memorise the symptom-fix mapping.
1. JAVA_HOME is not set or 'java' is not recognized.
Java isn’t installed, or it’s installed but JAVA_HOME and PATH aren’t pointing at it. Reinstall Temurin 17 with the “Set JAVA_HOME” checkbox ticked. Open a fresh terminal afterward.
2. Could not locate executable null\bin\winutils.exe.
Windows-only. You skipped the winutils section. Go back, install winutils, set HADOOP_HOME, open a new shell.
3. Py4JJavaError: An error occurred while calling ... Java gateway process exited before sending its port number.
Spark started a JVM, the JVM crashed before it could phone home. Usually one of:
- Wrong Java version (you have 21 or 8 when Spark wants 11/17). Switch.
- Antivirus or corporate VPN blocking the local socket. Try with VPN off.
- On Mac with Apple Silicon, you grabbed an x86 JDK. Get the ARM64 / aarch64 build of Temurin instead.
4. WARN ProcfsMetricsGetter: Exception when trying to compute pagesize.
Cosmetic Linux/macOS warning, ignore. Spark can’t find getconf PAGESIZE on some setups. No effect on jobs.
If you hit something not on this list, copy the first exception in the stack trace (usually buried under five layers of Py4J wrapping) and search for it. Ninety per cent of the time someone on Stack Overflow has had the same issue since 2016.
A quick word on IDEs
The course doesn’t assume any particular editor, but for what it’s worth: VS Code with the Python and Jupyter extensions is a very pleasant PySpark environment. Open any folder that has your .venv in it, hit Ctrl+Shift+P → “Python: Select Interpreter,” pick the venv. From then on the integrated terminal activates the venv automatically and the editor knows where pyspark lives.
For interactive work, drop a Jupyter notebook in the same folder:
pip install jupyter
jupyter notebook
You can run the same SparkSession-building code in a notebook cell. The session stays alive across cells, so you build it once at the top and then keep poking at DataFrames in subsequent cells. We’ll use this pattern throughout the course.
If you prefer JetBrains, PyCharm Community works equally well. DataSpell is JetBrains’ notebook-flavoured IDE and has nicer DataFrame previews.
Verifying with a slightly more realistic test
The five-line script above proves Spark starts. This one proves it can do work. Save as verify_spark.py:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg
spark = (SparkSession.builder
.appName("VerifySpark")
.master("local[*]")
.getOrCreate())
spark.sparkContext.setLogLevel("WARN")
# Generate 1 million rows synthetically — no external data needed
df = spark.range(0, 1_000_000) \
.selectExpr("id",
"id % 7 AS day_of_week",
"(id * 13) % 100 AS amount")
# Group, aggregate, sort
result = (df.groupBy("day_of_week")
.agg(count("*").alias("rows"),
avg("amount").alias("avg_amount"))
.orderBy("day_of_week"))
result.show()
print(f"Spark version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")
spark.stop()
Run it:
python verify_spark.py
You should see a 7-row table with day_of_week from 0 to 6, roughly 142,857 rows in each bucket, and an average amount around 49.5. Total runtime around 5–10 seconds on a modern laptop. If that works, your install is fully production-shape — generators, aggregations, sorts, the works.
Where we are now
You have:
- Java 11 or 17 installed and on the PATH.
- Python 3.8–3.12 in a virtualenv.
pyspark==3.5.1andpyarrowinstalled.- (Windows only)
winutils.exeat%HADOOP_HOME%\bin. - A working
hello_spark.pythat prints a tiny DataFrame.
That’s the whole development environment for the rest of the course. Next lesson, we open up SparkSession.builder and look at every important configuration knob — what local[*] actually does, why spark.sql.shuffle.partitions defaults to 200 even when you have 8 cores, and how to read the Spark UI on localhost:4040.