Python, from the ground up Lesson 37 / 60

ETL design: extract, transform, load

The pattern that defines half of all data engineering work, and the modern variations (ELT, medallion, lakehouse) you'll meet.

Welcome to Module 7. We’ve spent the last twelve lessons learning to analyze data — load a file, slice it, group it, plot it, write up the answer. That’s the consumer side of the data world. The producer side — the engineering that puts a clean dataset on the analyst’s disk in the first place — is what the next several lessons are about. It’s also a very large fraction of the actual paying jobs that have “Python” in the title in 2026.

The single most common shape that work takes is ETL: extract, transform, load. Three letters that define a pipeline. You read data from somewhere, change its shape, and write it somewhere else. Almost every batch data job in the industry is some variant of this. Today we walk through the pattern, the modern variations (ELT, medallion, lakehouse — the buzzwords your manager will use), and the design decisions you make every time you sit down to build one.

The three stages, decoupled

The reason ETL is a useful mental model and not just a description is that the three stages have different failure modes and benefit from being kept separate.

Extract is where you talk to the outside world. APIs go down, databases get slow, network blips happen, files arrive late or half-written. Anything that fails because reality is messy fails here.

Transform is pure computation. Given the same input it always produces the same output. No external dependencies, no clock, no network. If a transform fails, it’s a bug in your code, not a bad day on someone else’s server.

Load is the other side of the messy boundary — writing to a destination that has its own rules about uniqueness, ordering, transactions, and rate limits.

The discipline of keeping these decoupled pays back the first time something goes wrong. If extract and transform are smashed together and the API returns garbage, you can’t replay the transform on yesterday’s data without hitting the API again. If they’re decoupled — extract dumps raw to disk, transform reads from disk — you can re-run transforms a hundred times without ever touching the source. This is the single biggest design lever in pipeline work and most beginners ignore it.

def run_pipeline(date: str) -> None:
    raw_path = extract(date)        # writes raw/2026-04-01.json
    clean_path = transform(raw_path) # writes clean/2026-04-01.parquet
    load(clean_path)                 # upserts to warehouse

Three functions. Each one is independently testable, replayable, and skippable. That four-line scaffold is the bones of every pipeline I’ve ever shipped.

Extract: dealing with the source

Sources come in roughly four flavors and each has its own gotchas.

Databases (Postgres, MySQL, your company’s OLTP). Authentication is straightforward; the trap is load on the source. A naive SELECT * FROM orders against a busy production DB will get you a friendly Slack message. Use a read replica if available, prefer incremental queries with a watermark, and avoid running during business hours when you can.

APIs (REST, GraphQL, SOAP if you’re unlucky). Rate limits are the constant battle — see lesson 39 for the full retry-and-backoff toolkit. Pagination is always a thing and there are about four flavors. Authentication ranges from “API key in header” to “OAuth dance with refresh tokens.” Most APIs lie about something in their docs.

Files (CSVs in S3, Parquet in a data lake, vendor drops to FTP). Easy to extract, hard to know when they’re “ready” — see lesson 38 for the lockfile pattern. Schema drift is the silent killer: a new column appears, a date format changes, and your pipeline keeps running but produces nonsense.

Streams (Kafka, Kinesis, change-data-capture). A different beast — continuous rather than batch, with their own primitives (offsets, consumer groups, replays). Out of scope for this module, but worth knowing they exist.

The key extract decision is incremental vs. full. Full is simpler: every run, pull everything. Works fine until “everything” is a billion rows. Incremental is harder: you track a watermark — a timestamp, an ID, a Kafka offset — and pull only what’s changed since last time. The price you pay is correctness pressure: if your watermark logic has a bug, you silently miss data, and the bug can hide for months.

A safe pattern is incremental with periodic full refreshes. Daily incrementals, weekly full reload. The full reload catches anything the incrementals dropped, and the incrementals keep the daily cost low.

Transform: the pure-function ideal

The transform stage is where Python (or SQL — more on that in a minute) earns its keep. The work is some mix of:

  • Cleaning: drop garbage rows, fix types, normalize strings, handle nulls.
  • Normalizing: split denormalized blobs into proper relational tables, or vice versa.
  • Joining: enrich one source with another (orders + customers, events + sessions).
  • Aggregating: roll up to whatever grain the consumers want.

The discipline that separates good transforms from bad: make them pure functions. Same input → same output. No reading from the network mid-transform. No datetime.now() deciding which branch to take. No mutable global state.

Why? Because pure transforms are replayable. If a downstream report is wrong, you can rerun yesterday’s transform on yesterday’s raw data and get exactly the same wrong answer, then fix the code, rerun, and get the right one. If the transform reads from the API mid-run, you can’t — the API has moved on.

from pathlib import Path
import polars as pl

def transform(raw_path: Path) -> Path:
    df = pl.read_json(raw_path)
    df = (
        df.filter(pl.col("status") != "deleted")
          .with_columns(pl.col("amount").cast(pl.Float64))
          .with_columns(pl.col("created_at").str.to_datetime())
          .drop_nulls(subset=["customer_id"])
    )
    out = raw_path.parent.parent / "clean" / raw_path.with_suffix(".parquet").name
    out.parent.mkdir(parents=True, exist_ok=True)
    df.write_parquet(out)
    return out

That function takes a path and returns a path. No side effects beyond writing the output. You can call it a thousand times and it does the same thing. That’s the property to protect.

Load: write modes and idempotency

Three write modes you’ll meet:

  • Append: tack new rows onto the end. Simple, but reruns produce duplicates.
  • Overwrite: blow away the destination and write fresh. Safe to rerun, but expensive at scale, and breaks downstream queries while it’s running.
  • Upsert (insert-or-update on a key): the gold standard for idempotency. Rerun the same data, get the same result.

Idempotency is the magic word. A pipeline is idempotent when running it twice with the same input produces the same final state. Without it, every retry is a roll of the dice; with it, retries are free. The path to idempotency is almost always “use a key, upsert on it.” In Postgres that’s INSERT ... ON CONFLICT DO UPDATE. In Snowflake it’s MERGE. In a Delta or Iceberg table it’s MERGE INTO. We’ll use the Postgres flavor in lesson 38.

ELT: the modern reordering

For the last decade, the warehouses got a lot bigger and a lot cheaper. Snowflake, BigQuery, Databricks, DuckDB at the small end — all let you store enormous amounts of raw data and run SQL transforms on it cheaply. So the industry quietly swapped two letters.

ELT = Extract, Load, Transform. You dump the raw data into the warehouse first, then transform with SQL inside the warehouse. The transformation engine is the warehouse, not your Python script.

The win: the warehouse is built for this. Joins across hundreds of millions of rows are fast. Transforms are versioned in a tool like dbt, run on a schedule, and produce documentation for free. Python becomes a thinner layer that mostly just does extract-and-load.

When to pick which:

  • ELT when your destination is a real warehouse and your transforms are mostly SQL-shaped (filters, joins, aggregates). This is the default for analytics work in 2026.
  • ETL when transforms need real Python — calling ML models, parsing weird formats, applying complex business logic that doesn’t translate to SQL. Or when the destination isn’t a warehouse (an API, a search index, a downstream service).

In practice you’ll do both. The pure-Python ETL job and the dbt-style ELT model live happily in the same company.

Medallion architecture

The buzzword Databricks coined and the rest of the industry adopted: bronze, silver, gold layers.

  • Bronze: raw, exactly as ingested. Schema-on-read, full fidelity, append-only, kept forever (or as long as compliance lets you). If something downstream is wrong, the bronze layer is the source of truth you replay from.
  • Silver: cleaned and conformed. Standard column names, types fixed, deduped, joined to reference data. Still row-level — one customer-event per row.
  • Gold: analytics-ready. Aggregated, denormalized, organized around business questions. The tables a BI dashboard reads.

You don’t need Databricks to use this. The pattern works just as well with three Postgres schemas, or three S3 prefixes, or three subdirectories on a server. The point is the layering: each transform reads from one layer and writes to the next, and each layer has a clear contract.

Lakehouse and table formats

The third bit of vocabulary is the lakehouse: a data lake (cheap object storage like S3) with ACID transactions bolted on via an open table format. Three formats compete: Delta Lake (Databricks-led), Apache Iceberg (Netflix-led, now the industry-favorite open standard), and Apache Hudi (Uber-led).

What they give you: a Parquet-files-on-S3 layout that supports transactional updates, deletes, time travel (“show me this table as of last Tuesday”), and schema evolution. You get warehouse-like semantics on lake-cheap storage. In 2026 Iceberg has the strongest momentum — Snowflake, BigQuery, DuckDB, and Trino all read it natively.

You don’t need to use any of this on day one. You should know it exists so when an architect says “let’s land bronze in Iceberg” you nod knowingly.

The decisions you make every pipeline

Every time you sit down to build one, the same handful of choices come up:

  • Python or SQL? Use SQL if the warehouse is the destination and the logic is set-shaped. Python for everything else.
  • Scheduled or event-triggered? Cron-style for predictable batches; event-triggered (file-arrival, webhook, queue message) for low-latency or irregular sources.
  • Read once or incrementally? Full is simpler; incremental scales. Watermark carefully.
  • What’s the recovery story? If this fails halfway, what does the rerun look like? If you can’t answer that, you don’t have a pipeline yet — you have a script that worked once.

The pieces of a real pipeline

A production pipeline isn’t just three functions; it’s an ecosystem.

  • Orchestrator — the thing that runs the jobs in the right order on the right schedule. Cron for v1, Airflow / Prefect / Dagster when dependencies grow. Lesson 41 covers this.
  • Transformations — your Python and/or SQL.
  • Sinks — destinations. Often more than one (warehouse + cache + search index).
  • Monitoring — is the job running? How long? How much data?
  • Alerting — page someone when it breaks at 3am.
  • Lineage / catalog — what feeds what, who owns it, where did this column come from?

Most teams start with orchestrator + transformations and bolt the rest on as the blast radius of failures grows. There’s no shame in that. There is shame in not bolting them on once they’re needed.

Where we go from here

In lesson 38 we build a real ingestion pipeline — files arrive in a folder, a Python script picks them up, validates, loads to Postgres, with all the patterns that survive contact with reality. In lesson 39 we move to the API-extraction side: retries, rate limits, pagination. By lesson 41 we’ve got an orchestrator wrapped around the lot, and you’ve built something that wouldn’t be out of place in a real shop.

For now, the mental model: extract from the messy outside, transform in pure Python or SQL, load idempotently to the destination, layer your tables, and design every step so reruns are free. That’s ETL. The rest is details.

Citations

Search