Data quality: Great Expectations, Soda, dbt tests

Lesson 60 ended on a connection: SLOs without quality testing are guesses. The “less than 0.1% of expected rows missing” SLI has to be computed somewhere, and that somewhere is a data-quality framework that runs continuously, evaluates assertions against the data, and reports pass or fail in a way the SLO machinery can consume. This lesson is about the frameworks.

The market has converged on three open-source tools: dbt tests, Great Expectations, and Soda. They overlap in capability but have distinct centers of gravity, and the right answer in most data platforms is to use more than one. Picking among them as competitors usually produces a worse outcome than understanding what each is best at.

The framing for the whole lesson: data quality testing answers one question. Is the data I am about to use trustworthy? The “about to use” matters. A test that runs once a week catches problems a week late. The frameworks below are designed to run on every pipeline execution, which is what makes them actual quality controls rather than periodic audits.

The four classic dimensions

Data quality has accumulated taxonomies over the years. The four-dimension version is the one most teams settle on because it covers the practical failure modes without proliferating into categories nobody can keep straight.

Schema asks whether the columns exist with the right types. A pipeline expects customer_id as a non-null integer; the source provides it as a string with leading zeros that may or may not be present. Schema tests catch type drift, renamed columns, and dropped columns. Schema is the cheapest dimension to test and the highest-value: most production data incidents start with an undetected schema change.

Completeness asks whether the required values are present. A column declared NOT NULL in the contract but actually 5% null in the data is a completeness failure. Completeness tests can be column-level (is this column non-null where it should be?) or row-level (are all expected rows present, comparing counts against an expected cardinality or against the previous run plus a tolerance).

Validity asks whether values fall within expected ranges or sets. Order amounts should be positive. Country codes should belong to the ISO list. Email addresses should match a pattern. Validity tests catch data that is technically present but semantically wrong.

Consistency asks whether related tables agree. Foreign keys exist in the parent table. The sum of line-item totals matches the order header. The daily revenue total matches across the warehouse and the source ERP. Consistency catches failures the other three miss, because consistency violations can occur even when every individual table looks healthy.

The pattern in practice is to think of the dimensions as a triage hierarchy. Schema breaks before completeness, completeness before validity, validity before consistency. Test in that order; if schema is wrong, the other tests will be noisily wrong for the wrong reason.

dbt tests: in-warehouse, declarative, free

dbt’s test framework (https://docs.getdbt.com/docs/build/data-tests, retrieved 2026-05-01) is the option most warehouse-resident data teams already have. Tests are declared in YAML alongside the model definitions and execute as SQL against the warehouse. The bundled tests cover the high-frequency cases.

not_null asserts a column has no nulls. unique asserts no duplicates. accepted_values asserts values fall in an enumerated set. relationships asserts foreign-key integrity by checking that every value in column A exists in column B of another table. Beyond those, dbt_utils.expression_is_true runs an arbitrary SQL boolean expression as a test, which covers the long tail.

The strengths of dbt tests are pragmatic. The tests live in version control next to the models they protect, so they evolve together. They run as part of the dbt invocation, baked into the same CI pipeline that lesson 51 covered. They produce SQL that runs on the warehouse, so there is no data movement and no separate compute environment.

The weaknesses are the corollary. dbt tests run only inside the warehouse. They cannot test files in object storage before they are loaded, cannot test data leaving the warehouse for downstream consumers, cannot easily test continuously between scheduled runs. Test types beyond the bundled four require custom SQL or packages like dbt_expectations.

For a warehouse-centric platform where dbt is the transformation layer, dbt tests cover 70-80% of the practical quality testing; the remaining 20-30% justifies one of the other tools.

Great Expectations: portable, file-aware, profiling

Great Expectations (https://docs.greatexpectations.io/, retrieved 2026-05-01), abbreviated GX, is a Python library in production at scale since around 2019. Its philosophy is that data tests should be portable assertions about data shape, expressible in a JSON-like format that travels with the data rather than being tied to a specific database engine.

The unit of testing is the “expectation”. expect_column_values_to_not_be_null, expect_column_values_to_be_in_set, expect_column_mean_to_be_between. The library is large, with dozens of expectations covering numerical, categorical, and statistical properties, plus the ability to define custom expectations in Python.

The strength is reach. GX runs against pandas DataFrames, Spark DataFrames, SQL warehouses (via SQLAlchemy), and files in object storage. The same assertion can be applied at multiple stages: on a Parquet file landing in S3, on a Spark DataFrame mid-transformation, on a warehouse table after dbt has materialised it.

GX also profiles data. Pointing it at a new dataset produces an automatically generated set of candidate expectations based on the observed distribution. The profiles are not production-ready as-is (too tight, every random fluctuation will fail them) but they are an excellent starting point.

The weakness is operational complexity. GX has a richer concept model than dbt or Soda: data sources, batches, expectation suites, validation results, data docs. For teams that have warehouses and dbt and just want failed tests to fail their pipeline, GX often feels like overkill. Where it shines is at boundaries: ingestion, egress, and validation of files before any compute touches them. dbt is better inside the warehouse; GX is better at the edges.

Soda Core and Soda Cloud: YAML rules, focused product

Soda (https://docs.soda.io/, retrieved 2026-05-01) splits into two parts: Soda Core, the open-source library that runs checks, and Soda Cloud, the commercial dashboard and alerting layer. The differentiator is the rule language: SodaCL, a YAML-based language designed to read like English assertions about data.

A SodaCL check looks like missing_count(email) = 0 or duplicate_count(customer_id) = 0 or freshness(updated_at) < 1d. The vocabulary is tighter than GX (fewer kinds of checks) and the syntax is designed to be readable by analysts and data product managers, not just engineers. For organisations where data quality is owned by a mixed-skill team, the YAML-and-English approach reduces the friction of keeping checks current.

Soda runs against warehouses via connectors and against streaming data via integration with Spark and Kafka. It positions itself between dbt and GX: more flexible than dbt (because it is not tied to dbt’s lifecycle), more focused than GX (because it does fewer things deliberately). The Cloud product adds dashboards, alerting, lineage views, and incident workflows for teams that want a managed UI rather than building one on top of the open-source library.

In practice, the Soda choice often comes down to whether the organisation values the readable rule language and the Cloud dashboards enough to standardise on it, or prefers the GX richness or the dbt simplicity.

Patterns that work

Three operational patterns hold across the tools.

Test at boundaries, not at every step. Data pipelines have many stages: ingestion, staging, transformation, marts, exports. Putting tests at every stage produces a sea of assertions that nobody reads and that fail in correlated ways when something upstream breaks. The pattern that works is to test on ingestion (the raw data is shaped right, the schema matches the contract, the row count is plausible) and to test on output (the published data is correct, the marts have the right cardinalities, the metrics reconcile against the source). Intermediate steps get a thin layer of structural tests (schemas) but not the full quality suite. If ingestion is right and output is right, the middle is covered by the model tests in dbt or its equivalent.

Tier the tests by severity. Not every test failure is a pipeline-blocking event. A useful tiering: critical tests block the pipeline (the run halts and the bad data does not propagate), warning tests alert but allow continuation (the data flows but a notification is sent), informational tests log only (visible in dashboards but no notification). Tiering keeps the alerting volume manageable and prevents the “everything is critical” trap that destroys on-call sanity. A column-level non-null on a primary key is critical. A row-count drift of more than 20% is a warning. A row-count drift of more than 5% is informational.

Run continuously, not just on pipeline runs. The pipeline runs once per hour or once per day; data quality issues can arise between runs. Running a subset of the most important checks on a continuous schedule (every five minutes, every fifteen) catches drifts faster and feeds the SLO measurement loop from lesson 60. The tools all support this, though the details vary: dbt can be scheduled separately for tests, Soda has a cloud scheduler, GX runs wherever the orchestrator runs it.

The trap of over-testing

The failure mode that destroys data quality programmes is over-testing. The pattern is recognisable: a quality framework gets adopted, somebody runs the GX profiler or its equivalent, hundreds of generated tests get checked in, and within three months the team has hundreds of tests that nobody looks at. Some fail every run because the assertions are too strict. Some pass while the data is genuinely wrong because the assertion was the wrong shape. The team starts ignoring the test results as a category, which means the framework provides no signal anymore, which means the quality programme has effectively failed even though the tests still run.

Three specific anti-patterns produce this outcome.

Tests with no owner. The test was generated by a tool or written by an engineer who has since moved teams. When it fails, nobody knows whether the failure is real or whether the assertion was wrong from the start. The default reaction is to disable the test, and slowly the test suite hollows out.

Tests with the wrong tolerance. The expected row count is “between 1000 and 1010” because that was the range observed during a quiet week. The first busy week, the count goes to 1500, and the test fails not because anything is wrong but because the tolerance was never updated. Every test with a numeric range needs to be revisited as the data scales.

Tests that pass while the data is wrong. The accepted-values test allows ['active', 'inactive', 'unknown']; the data starts emitting 'pending' because of a new state in the source system, and the test catches it. So far so good. But a test that asserted “no nulls in the email column” passes happily while the email column is mostly empty strings, because the data was wrong but the assertion did not test for the actual problem. Choosing the right assertion shape is more important than having many assertions.

The discipline that prevents the trap is tying tests to SLOs. Lesson 60 said the SLO is “less than 0.1% of expected rows missing per day”. The corresponding tests are exactly the ones that, if they fail, indicate the SLO is at risk. Every test should be answerable to the question “what SLO does this protect, and what would happen if I removed it?”. Tests that cannot answer either question are candidates for deletion.

Quality checks at boundaries

flowchart LR
    A[Source System] --> B[Ingestion check]
    B --> C[Raw landing zone]
    C --> D[Schema check]
    D --> E[Transform pipeline]
    E --> F[Mid-pipeline structural check]
    F --> G[Warehouse marts]
    G --> H[Output reconciliation check]
    H --> I[Published data]
    I --> J[Downstream consumers]

The diagram shows the recommended layering. Ingestion check (Great Expectations on the file, or schema-on-read in the loader) verifies the raw data is shaped right before it enters the platform. Schema check on the raw landing zone catches type drift before any compute touches the data. Mid-pipeline structural check (dbt’s not_null and unique on staging models) catches the structural failures cheaply. Output reconciliation check (Soda or custom SQL comparing warehouse totals against source-system totals) catches accuracy failures at the boundary where they would otherwise leak to consumers. Each layer has a focused purpose; together they cover the four dimensions without duplication.

The lesson connects upward: every check in the diagram produces an SLI, every SLI feeds an SLO, every SLO has a tier, and tier-1 SLOs are the ones the on-call gets paged for. Lesson 62 picks up that thread: when a check fails, when a freshness SLO burns budget, when an accuracy reconciliation flags a five-figure variance, what does the team actually do? Detection is half the work. The other half is the incident response discipline that turns an alert into a recovery.