Welcome to Module 5. Module 4 left us with a set of well-replicated, well-partitioned, well-sharded databases, and a Discord case study that shows what those abstractions look like at scale. Storage is the floor. Now we have to talk about what flows through it.
Module 5 covers batch processing: the family of jobs that read a finite (and usually large) input, do work on it, and write a finite output, with no expectation of completing in milliseconds. The job runs, finishes, and stops. Hourly aggregates, daily reports, machine-learning training data, the columns the BI dashboard reads at 7am: all of it is batch. The dual of batch is stream processing, which Module 6 covers. For now, the data is at rest, and the job has time to think.
We start with the question that organises every batch pipeline ever written: where does the transformation step live? In 1995 the answer was different from the one in 2026, and the shift between the two answers is the cleanest way to see why the modern data stack looks the way it does.
The two letter orders
A batch pipeline that moves data from a source system into an analytics store has three steps. You extract records from the source (a transactional database, a SaaS API, a CSV that lands in an FTP folder, a log stream archived to disk). You transform them: clean nulls, deduplicate, parse dates, join with reference data, compute derived columns, conform to a target schema. And you load the result into the analytics destination, which is almost always a data warehouse.
The two letter orders that name pipelines are about where the transform happens.
ETL stands for extract, transform, load. The transform step runs in a separate compute layer between the source and the warehouse: a dedicated ETL server, a Spark cluster, a fleet of Python workers. Data leaves the source, gets reshaped in flight, and arrives at the warehouse already in its analytical form. The warehouse stores only the cleaned, conformed, query-ready output.
ELT stands for extract, load, transform. The transform step runs inside the warehouse, on data that has already landed there in something close to its raw shape. Data leaves the source, gets dumped into the warehouse with minimal processing, and the transform happens as a sequence of SQL statements that read raw tables and write derived ones.
The pipelines look almost identical from a distance. The same three boxes, the same three arrows. The position of one box is the entire story of why the data stack rebuilt itself between 2015 and 2025.
flowchart LR
subgraph ETL[ETL]
S1[(Source)] --> T1[Transform layer<br/>Informatica, Talend, Spark]
T1 --> W1[(Warehouse<br/>cleaned only)]
end
subgraph ELT[ELT]
S2[(Source)] --> W2[(Warehouse<br/>raw + derived)]
W2 --> T2[Transform in-warehouse<br/>dbt, SQL]
T2 --> W2
end
The ETL era
Through the 1990s and 2000s, ETL was not a choice. It was the only architecture that made economic sense, and the reason was the warehouse.
The dominant warehouses of that era were Teradata, Oracle, and SQL Server’s data-warehouse editions, with IBM Db2 in some shops. They were sold as appliances or as expensive licensed software, sized for a fixed workload, with storage and compute coupled in a single box. Adding capacity meant buying a bigger appliance, which meant a procurement cycle measured in quarters. Warehouse compute was the scarcest resource in the building, and it was reserved for the queries the business actually ran.
Loading raw data into that warehouse and then running cleanup queries against it would have been an act of vandalism. The cleanup queries would have competed with the analyst queries for the same expensive cores, the raw data would have eaten the same expensive disks the business expected to hold dimensional models, and the appliance would have been sized to the wrong workload within a year. So you cleaned the data somewhere else.
That somewhere else was the ETL tier. Informatica PowerCenter (still the largest ETL vendor by revenue when it went private again in 2023), IBM DataStage, Microsoft SSIS, and Talend defined the category. Underneath the GUI, an ETL job was a graph of operators, each reading rows from upstream, doing some work, and pushing rows downstream, with the final operator writing into the warehouse. Many shops bypassed the vendors entirely and wrote ETL jobs as cron-driven Python or shell scripts that pulled from sources, transformed in pandas or PL/SQL, and inserted the result.
The architectural property of the ETL era is that the warehouse only ever sees clean data. Raw rows from the source system never land in it. If a transformation rule was wrong, you fixed the rule in the ETL layer and reran the job, which meant re-extracting from the source. There was no “raw” layer to fall back to.
What changed
Cloud warehouses changed the economics, and the order of operations followed.
Snowflake (general availability 2014), BigQuery (public 2012), and Redshift (2013) moved warehouses to a separation-of-storage-and-compute model running on object storage and on-demand compute. Storage became cheap (cents per gigabyte per month, on commodity object stores under the hood). Compute became elastic (spin up a warehouse for a query, spin it down when idle). Adding capacity stopped being a procurement cycle and started being a setting in a config file. Databricks SQL (2021) brought the same model to the lakehouse side of the world.
Once warehouse compute was cheap and elastic, the argument for keeping raw data out of it evaporated. You could land everything, store it forever for pennies, and run the cleanup queries on the same engine that ran the BI queries. The transformation stopped being an external pre-processing step and became a sequence of SQL statements that lived next to the rest of the analytical code.
The tooling reorganised around the new shape. The modern ELT stack has three layers.
Extract and load. Fivetran, Airbyte, Stitch, and Meltano dominate the managed connector business. They handle the ugly part of integration: the SaaS API quirks, the schema drift, the incremental-pull logic, the credential rotation. The output is raw tables in the warehouse that look as much like the source as is reasonable. Many teams supplement managed connectors with custom Python ingestion for the long tail of internal systems, which lands in the same raw layer.
Warehouse. Snowflake, BigQuery, Redshift, Databricks SQL. The same engine stores the raw data, runs the transformations, and serves the BI tool. There is no separate compute tier for the transforms.
Transform. dbt is the dominant tool of the modern data stack in 2026. A dbt project is a directory of SQL files. Each file is a select statement that defines a model. dbt compiles the project into a dependency graph and runs the models in order, materialising each as a table or a view in the warehouse. Versioned in git, tested with assertions, documented automatically, executed by CI. The transformation logic stops being a black box inside an ETL tool and becomes a code repository the data team owns the same way the application team owns its services.
Why ELT wins for most cases
A few reasons compound.
The raw layer is preserved. Once data lands in the warehouse in something close to its source shape, you can re-derive anything downstream. A new metric definition does not require a re-extract; you write a new SQL model. A bug in a historical transformation can be fixed and the affected models rebuilt without going back to the source. The raw layer is the single source of truth, and the source system is no longer in the loop for analytical questions.
The transforms live in SQL. The people who know the business questions are usually analysts, and analysts know SQL. ELT puts the transformation code in the language the people closest to the data already speak, instead of routing it through a separate ETL tool that requires a separate skill set.
The same warehouse serves the dashboard and the transform. Operationally, you have one system to scale, one system to monitor, one system to back up. The transformations and the BI queries share a workload manager and an access-control model. Lineage tools (dbt’s docs, plus newer entrants like Atlan, OpenMetadata, and Datafold) can trace a column on a dashboard back through every model that produced it, because all of it is queries against the same engine.
Cost is mostly storage, which is cheap. The economic intuition flipped: keeping the raw data is almost free, and the transformations only run when something needs to be refreshed. Compute is paid for by the second on the queries that actually run, not by a static appliance that sits idle three nights a week.
When ETL is still right
ELT is the default in 2026, but the cases where the transform belongs upstream are real.
The source data is sensitive. If the input contains personal data that the warehouse is not allowed to hold (specific health-record fields, raw payment numbers, certain identifiers under GDPR, HIPAA, or PCI scopes), you cannot land the raw data and then clean it up. The data has to be filtered, anonymised, or tokenised before it leaves the source environment. That is an ETL pipeline by necessity, even if the rest of the stack is ELT.
The volume is so large that warehouse compute is too expensive. Cloud warehouses are cheap per query, but petabyte-scale ingest with heavy per-row processing can still cost more in warehouse credits than running the same transform on a Spark cluster on raw object storage. The lakehouse pattern (Module 5 lesson 37) addresses this by moving the transforms onto cheap compute against open table formats, but the principle is the same: when warehouse compute is the bottleneck, do the heavy work elsewhere.
The transform requires something the warehouse cannot do. Image processing, audio transcription, ML inference, calls to external APIs to enrich rows: warehouse SQL engines do not run these well, if at all. The transformation lives in a Spark or Python tier upstream, and only the result lands in the warehouse.
The source cannot afford to send raw data. An IoT fleet generating terabytes of sensor readings per hour usually cannot afford the bandwidth to ship raw to the cloud. Edge aggregation reduces the volume before it leaves the device, which is ETL by another name.
The hybrid is the common reality. The bulk of the pipelines are ELT into the warehouse; the special cases (PII filtering, ML enrichment, edge aggregation) are ETL-flavoured pre-processing that lands a slightly cooked version of the raw layer.
What this leaves us with
The shift from ETL to ELT was not a fashion change. It tracked a real change in the cost curves of warehouse storage and compute, and it pulled the transformation step from a closed tool into a SQL repository the data team can review, test, and version like any other code.
The architectural floor underneath modern ELT is the warehouse-or-lakehouse engine that runs the transformation queries. That engine has its own history. The descendants of MapReduce, the open-source distributed batch processors that the cloud warehouses copied parts of internally, are the subject of lesson 34. Before Snowflake made batch invisible behind a SQL prompt, somebody had to invent the idea that a thousand cheap machines could replace one expensive one. That somebody was Google, and the open-source community spent the next decade catching up.
Citations and further reading
- dbt Labs, “What is dbt?”,
https://docs.getdbt.com/docs/introduction(retrieved 2026-05-01). The reference for the modern in-warehouse transformation pattern. - Snowflake, “Key Concepts and Architecture”,
https://docs.snowflake.com/en/user-guide/intro-key-concepts(retrieved 2026-05-01). The separation of storage and compute that made ELT economical at scale. - Google Cloud, “BigQuery overview”,
https://cloud.google.com/bigquery/docs/introduction(retrieved 2026-05-01). - Fivetran, “ETL vs. ELT”,
https://www.fivetran.com/learn/etl-vs-elt(retrieved 2026-05-01). Vendor-perspective summary, useful for the historical timeline. - “Fundamentals of Data Engineering” (Joe Reis and Matt Housley, O’Reilly, 2022), chapters on ingestion and transformation. The standard contemporary reference for the modern data stack.