CI for data pipelines: testing without burning a cluster

Continuous integration for a web service is a well-trodden path. Every push, the test suite runs against the new code, the suite finishes in a few minutes, and the green light tells you the change is safe to merge. The shape of the test pyramid is mostly settled: lots of unit tests, fewer integration tests, a thin layer of end-to-end tests, and the whole thing fits comfortably inside a CI runner with a few gigabytes of RAM.

CI for data pipelines does not fit comfortably anywhere. The thing you want to test runs over terabytes. The cluster that runs it costs real money per minute. The bug you are trying to catch often only shows up when the input has the messy distribution of real production data, which you cannot copy into a CI runner. Run the actual pipeline on every pull request and the cloud bill becomes its own load-bearing problem; do not run it at all and you ship bugs to prod.

The compromise the industry has settled on is layered, the same idea as the test pyramid for software but with different layer thicknesses and different tooling. This lesson walks through the layers, the tools that fit each one, and the patterns that make pipeline CI both fast and trustworthy.

Why CI for data is different

Three properties of data work pull CI in directions web-service CI does not have to handle.

Pipelines are stateful by definition. The output of one run is the input of the next. A unit test on a single transformation is informative but not sufficient; you also want to know that the transformation composes correctly with everything around it. Web-service tests rarely have to think about state continuity, because each request is meant to be independent.

The blast radius of a bug is bigger and quieter. A bug in a service produces 5xx errors that monitoring catches in minutes. A bug in a pipeline writes wrong rows to a table that downstream jobs and dashboards trust. By the time someone notices the user count is off by a factor of two, the wrong data is in a dozen derived tables. The case for catching pipeline bugs before merge is stronger, not weaker, than for web services.

Real test data is hard. You cannot ship production data to CI without running into privacy and regulatory problems. You cannot generate synthetic data that matches the distribution of the real thing without effort. The default of “test against a small canned sample” is correct, but you have to be honest that the sample misses whole categories of bugs that only the real distribution exposes.

These three properties shape every choice below.

The CI pyramid for pipelines

The web-service test pyramid has three layers: unit tests at the bottom, integration tests in the middle, end-to-end tests at the top. The same shape works for pipelines, with the layer contents redefined.

Unit tests are the bottom layer. They run on small in-memory data, in milliseconds, in the language the pipeline is written in. Pure-function transformations are unit-testable: pass in a fixture, assert on the output. A function that takes a DataFrame of trip rows and returns a DataFrame of cleaned trip rows can be tested with a five-row input. PySpark code, dbt macros, Flink user-defined functions, Pandas transformations: all of them have unit-testable cores if the code is structured around pure functions instead of a single 800-line job script.

The discipline that makes unit tests possible is separating the business logic from the I/O. A job that reads from S3, transforms, writes to a warehouse, and orchestrates retries is hard to unit-test. The same job, refactored so the transformation is a function that takes a DataFrame and returns a DataFrame and the I/O is in a thin wrapper, is easy to unit-test. Most “data code is hard to test” complaints reduce to “this job has its transformations entangled with its I/O”.

Integration tests are the middle layer. They run the pipeline end-to-end on a sample dataset, a few thousand rows, in CI. The point is to catch bugs that unit tests miss: schema mismatches between stages, accidentally dropped columns, wrong join keys, broken contracts between transformations. The sample is small enough to fit in a CI runner, big enough to exercise every code path.

Sample fixtures live in the repo. Canonical inputs in CSV or Parquet files, checked in, treated as part of the test suite. When the pipeline changes shape, the fixtures get updated in the same pull request as the code that needs them. Reviewers see both at once.

End-to-end tests are the top layer. They run on a staging cluster against realistic-but-anonymized data. They are slow and expensive, so they run nightly or weekly, not on every pull request. They catch the things the small sample cannot: data-skew bugs, performance regressions on the real volume, orchestrator interactions, retry behaviour under partial failure.

The pyramid shape matters. Most teams that get CI for data wrong do it by inverting the pyramid: lots of expensive end-to-end runs, few unit tests, no integration layer in between. The cloud bill goes up and the feedback loop slows down, and the bugs the team actually has are unit-test-shaped bugs that the slow end-to-end tests catch only by accident.

dbt: declarative tests as a first stop

For SQL-shaped pipelines, the dbt project conventions are the first place to look. dbt ships with a set of declarative tests that you attach to columns or models in YAML.

The four built-in tests cover most of the bugs you will hit:

not_null: this column should never have null values.
unique: this column should have no duplicates.
accepted_values: this column should only contain values from a known set.
relationships: this foreign key should exist in the referenced table.

Custom tests are arbitrary SQL queries that should return zero rows when the data is healthy. “No row should have a negative amount.” “No customer should have more open subscriptions than allowed.” “Yesterday’s totals should be within five percent of the seven-day average.” Each is a query whose nonzero result is a failed test.

In CI, dbt tests run against a smaller dev warehouse: a separate BigQuery dataset, a separate Snowflake schema, a separate Postgres database. The pull request triggers a job that runs dbt build against this isolated target, runs dbt test to check the assertions, and reports back. If anything fails, the pull request goes red.

Great Expectations is a more flexible cousin for non-dbt setups. Module 8 covers data quality in more depth; for now, treat dbt tests and Great Expectations as the declarative layer, and pytest as the imperative layer for code-shaped logic that does not fit cleanly into a SQL assertion.

pytest for the Python side

For Python-based pipelines (PySpark, Pandas, custom transformation libraries), pytest is the standard. The patterns are the same as for any Python codebase, with two extras worth calling out.

Fixture data lives in the repo. A tests/fixtures/ directory with small CSV or Parquet files. A pytest fixture loads them into a DataFrame at the start of each test. The fixtures are part of the code review: if you change the schema of the input, you change the fixture, and reviewers see both.

For PySpark, a SparkSession configured for local mode at the top of the test module gives you a real Spark in CI. It is slower than pure Pandas but exercises the actual code path. The pattern most teams settle on is: test the transformation logic with Pandas-equivalent fixtures for speed, then have a smaller integration suite that confirms the same logic works on Spark with a tiny session.

The fresh-warehouse pattern

The integration layer often needs more than just a dev schema. The clean version is: every pull request spins up a temporary, fully isolated warehouse, runs the pipeline against it, validates the output, and tears it down.

For BigQuery, this means a fresh dataset named after the pull request, dropped on merge or close. For Snowflake, a fresh schema cloned from a small reference schema. For Postgres, a fresh database in a CI Postgres instance.

The benefit is isolation: two pull requests cannot interfere with each other, and the test environment is clean every time. The cost is the orchestration: setup and teardown have to be reliable, and the cleanup has to handle pull requests that are abandoned without merging.

The pattern earns its keep on teams where multiple pipeline changes are in flight at once. For a small team with one or two engineers, a single dev warehouse with conventions about who is using it works fine.

flowchart TB
    PR[Pull request opened] --> SETUP[Spin up dev schema]
    SETUP --> BUILD[dbt build on sample data]
    BUILD --> TEST[dbt test plus pytest]
    TEST --> REPORT[Report status to PR]
    REPORT --> TEARDOWN[Teardown dev schema]
    TEARDOWN --> APPROVE[Approve and merge]

When test on prod is the right answer

There is a strain of advice in the data-engineering community that “you cannot really test a pipeline until it runs on real data”. That is too strong, but it points at something true. The real data has the bugs you most want to catch: skewed distributions, late-arriving rows, malformed values from upstream sources, the long tail of edge cases that synthetic data never reproduces.

The discipline that makes this safe is not “skip CI”. It is “deploy to a parallel layer, validate, then cut over”. Lesson 36 introduced the medallion architecture; the silver layer is a natural place to land a new version’s output without consumers seeing it. A new pipeline version writes to a side table for a week; the team compares the side-table output to the production output; once the diffs are small and explicable, the consumer pointer flips.

This is the dark-launch pattern from lesson 52, applied at the testing level rather than the deployment level. CI gives you confidence that the code is structurally correct. The dark launch gives you confidence that it is correct on the real distribution. They are layers, not alternatives.

What this means in practice

The realistic CI setup for a data team starts small and grows.

Day one is a pytest suite for the transformation functions and a dbt test step for the SQL models. That alone catches the bulk of bugs and runs in a few minutes.

Day thirty adds the integration layer: a small canonical input, run end-to-end through the pipeline in CI, output validated. The fixtures live in the repo, the pull request sees the same shape of data the production pipeline does.

Day ninety adds the fresh-warehouse pattern if multiple changes are in flight at once, or sticks with a single dev warehouse if not. The end-to-end nightly run starts existing as a separate pipeline in the orchestrator, against anonymized realistic data.

What the team does not do, ever, is run the full production pipeline on the full production data on every pull request. That is the path the cloud bill never recovers from.

CI is the upstream of CD. Lesson 52 covered the deployment patterns once the change is merged. The two work together: CI catches the bugs, CD limits the blast radius of the ones that slip through.

Citations

dbt documentation, “Tests” (https://docs.getdbt.com/docs/build/tests, retrieved 2026-05-01).
Great Expectations documentation (https://docs.greatexpectations.io/, retrieved 2026-05-01).
pytest documentation (https://docs.pytest.org/, retrieved 2026-05-01).
“Continuous integration for data” on the dbt blog (https://www.getdbt.com/blog/, retrieved 2026-05-01).