Module 8 is about reliability. The first half of the course built systems. Module 7 covered the practices that ship them. This module is about the disciplines that keep them running once real users depend on them, and it has to start with the question every data team eventually gets asked: how reliable is the data, exactly?
The honest answer most teams give is some shape of “pretty reliable, usually”. That answer does not survive a single bad week. The dashboard was late on Tuesday. The numbers on Friday’s report did not match the source system. The customer signed a contract assuming overnight refresh and is now annoyed that “overnight” sometimes means 11am. Nobody wrote down what reliability meant, so everyone has a different mental model, and every incident becomes a renegotiation.
Google’s Site Reliability Engineering practice solved this for software services about fifteen years ago. The framework is now widely adopted: SLIs measure, SLOs target, SLAs commit, and error budgets connect reliability to the rest of the work. The framework was written for request-response systems (latency, availability, error rate of an HTTP service), but the same shape applies cleanly to data products if the indicators are chosen carefully. This lesson translates SRE’s framework into the data-engineering vocabulary and walks through the policies that make it operational.
The four terms
The vocabulary is small but each term has a precise role. Conflating them is the most common mistake in early adoption.
SLI, the Service Level Indicator, is a measurable property of the system. It is a number you can compute from data your monitoring already collects. “Fraction of dashboard refreshes that complete by 9am, computed daily” is an SLI. “Fraction of rows in the customer table that are non-null on email, computed hourly” is an SLI. The defining test is: can a script return a number? If yes, it is a candidate SLI. If the property is “data quality”, it is not yet an SLI; it is a category that needs sharpening into a measurable.
SLO, the Service Level Objective, is the target you set for the SLI. The SLI is the measurement; the SLO is the goal. “99% of dashboard refreshes complete by 9am, measured monthly” is an SLO built on the previous SLI. The objective is internal. It is the bar the team holds itself to, and it is the bar that drives engineering priorities. The number is chosen deliberately, not aspirationally; more on that below.
SLA, the Service Level Agreement, is a contractual commitment to an external party with consequences if missed. SLAs usually involve money: service credits, refunds, contractual penalties. The crucial property is that the SLA is almost always weaker than the corresponding internal SLO. The team commits internally to 99.9% so that it can confidently sign an external SLA at 99%. The gap is the safety margin. A team whose SLA equals its SLO is one bad week away from paying penalties. A team whose SLA is laxer than its SLO has room to absorb normal variance without breaching contracts.
Error budget is the inverse of the SLO. If the SLO is 99%, the error budget is 1%. Over a month, that 1% is roughly 7 hours of allowable downtime, or several thousand allowable failed refreshes, depending on how the SLI is counted. The budget is a finite resource, like a credit card balance for unreliability. Every incident, every late refresh, every missing row spends some of it. When it is gone, behaviour changes (described below).
The relationships are easy to lose track of. The SLI is the gauge on the dashboard. The SLO is where the green-amber-red bands sit. The SLA is what is in the contract. The error budget is the headroom before the SLO is breached.
Applying the framework to data
Data products have four reliability dimensions worth instrumenting, and each maps cleanly to an SLI shape.
Freshness SLOs capture how stale the data is allowed to be. “The customer table is no more than 1 hour stale, 99% of the time, measured over rolling 30 days.” The SLI is computed from the maximum lag between source-system commit time and warehouse-visible time, sampled however the system allows. Freshness is what most business stakeholders mean when they say “the data is broken”; the report is yesterday’s data when they expected today’s, and the model that decides what “today’s” means is the freshness SLO.
Completeness SLOs capture missing rows. “Less than 0.1% of expected rows are missing per day, measured per pipeline.” The SLI compares expected row counts (from source-system metadata, from a known cardinality, or from the previous day plus a tolerance) against actual row counts. Completeness failures often hide because the dashboard still renders; the chart just shows a smaller number than it should, and nobody notices until a sales person asks why their region looks wrong.
Accuracy SLOs capture wrongness, not absence. “The daily revenue total in the warehouse matches the source-of-truth ERP system within 0.01%, measured nightly.” The SLI is a reconciliation check: take the same metric from two systems, compare. Accuracy is the hardest dimension to instrument because you need an independent ground truth, but it is also the dimension that destroys trust fastest when it fails. Stakeholders forgive a late dashboard. They do not forgive a wrong dashboard.
Availability SLOs capture whether the data is queryable when expected. “The BI dashboard is queryable 99.9% of the hours during business hours, measured per region.” The SLI is a synthetic probe: every minute, an automated query runs, and the system records whether it succeeded within a latency threshold. Availability is the closest to traditional service SLOs, because the dashboard is essentially a request-response system once the data is loaded.
Most data products want SLOs in two or three of these dimensions, not all four. Choose the dimensions that map to actual stakeholder pain.
Why SLOs work better than “we should just be reliable”
The pre-SLO model of reliability is implicit and adversarial. Engineering wants to ship features. Operations or stakeholders want fewer incidents. Every incident becomes a fight about whether the team is moving too fast, and there is no shared standard for “too fast”. The SLO framework replaces that fight with three properties.
It makes the trade-off explicit. 100% reliability is not achievable, and even if it were, the cost would be absurd: every nine of additional reliability typically costs an order of magnitude more engineering effort. The SLO names the target deliberately. 99% is cheap. 99.9% is meaningfully more expensive. 99.99% requires real engineering investment. 99.999% is rarely justified outside payment systems and life-safety. Choosing 99% versus 99.9% is choosing how much of the engineering budget goes to reliability versus features, and the SLO makes that choice visible instead of pretending it does not exist.
It gives the team permission to be imperfect. An incident that consumes part of the error budget is not a moral failure. It is the system spending the budget that was already allocated to it. Over a quarter, some incidents are expected. The team that has zero incidents is probably under-investing in feature velocity; the team that exhausts its budget every month is under-investing in reliability. The budget makes the conversation quantitative.
It prioritises work automatically. When the budget is healthy, ship features and take risks. When the budget is gone, the same risks are no longer acceptable, and engineering effort shifts to reliability work until the budget recovers. The shift is policy, not negotiation. There is a written rule, the team follows it, the conversation about what to work on becomes brief.
The error-budget policy
The error-budget policy is the document that turns the abstract budget into operational behaviour. The Google SRE Workbook (https://sre.google/workbook/error-budget-policy/, retrieved 2026-05-01) gives a template; most organisations adapt it. The structure is three states with different rules.
Budget healthy (more than 50% of the period’s budget remains): normal operations. Ship features. Take reasonable risks. Deploy on Fridays if the deployment system is reliable enough. Run experiments. The team is operating in the regime the SLO was designed to allow.
Budget low (less than 50%, more than 0%): caution. Risky changes get extra review. Major migrations get postponed if not already in flight. Deployment frequency may slow. The team starts paying down reliability debt: alert tuning, runbook updates, the small fixes that have been deferred. The goal is to recover budget before exhaustion forces harder measures.
Budget exhausted (zero or negative for the period): freeze. No deployments except those that directly improve reliability or fix the incidents that consumed the budget. Feature work pauses. The team focuses on getting the SLI back into the green and on the engineering work that prevents the next breach. The freeze is uncomfortable, which is the point: it creates pressure to invest in reliability before the budget runs out, not after.
The freeze is the controversial part. Engineering teams resist it because feature work is more visible. Product managers resist it because roadmap commitments slip. The framework only works if leadership backs the policy when it is invoked. A policy that is suspended the first time it actually triggers is a policy that has trained the team to ignore it.
Setting realistic SLOs
The most common SLO mistake is picking the number aspirationally. “We want 99.99% because the customer asked for it” is not an SLO. It is a wish. The actual SLO has to be something the system can deliver with reasonable engineering effort, and the way to find that number is to start from current performance.
Measure the SLI for a representative period (a month, a quarter). Look at the distribution. The SLO should be set just above current performance, not far above it. If the dashboard currently refreshes by 9am 97% of the time, set the SLO at 98%, not 99.9%. The error budget at 98% gives the team room for the variance it currently experiences, while the small step up forces incremental improvement. After a quarter at 98%, raise to 98.5%. Slow ratchet, sustainable improvement.
The opposite mistake is setting the SLO too laxly. If current performance is 99.5% and the SLO is set at 95%, the budget is so generous that incidents do not trigger the policy, and the framework provides no operational pressure. The SLO has to be set at a number that the system actually risks breaching, otherwise it is theatre.
External commitments (the SLA) are then derived from the internal SLO, not the other way around. Internal team commits to 99% as the SLO. External contract commits to 95% as the SLA. The 4-percentage-point gap is the safety margin. Customers see a 95% commitment and the team operates against a 99% target, which means the SLA is almost never breached even when the SLO is.
The error-budget loop
flowchart TD
A[Measure SLI from monitoring] --> B[Compare to SLO target]
B --> C[Compute remaining error budget]
C --> D{Budget state?}
D -->|Healthy| E[Ship features, take risks]
D -->|Low| F[Caution, pay down reliability debt]
D -->|Exhausted| G[Freeze, focus on reliability]
E --> H[Next measurement period]
F --> H
G --> H
H --> A
The loop is the daily and weekly rhythm. Monitoring computes SLIs continuously. A dashboard displays current SLO compliance and remaining budget. Weekly or monthly review meetings inspect the budget state and adjust the work allocation. Incidents update the budget in real time. The framework becomes the lens through which the team sees its own reliability work.
The maturity arc is gradual. The first quarter, the team picks SLOs that turn out to be wrong (too tight, too loose, measuring the wrong thing) and revises them. The second quarter, the SLIs are reasonable but the policy has not yet been invoked, so it has not been tested. The third or fourth quarter, the budget exhausts for the first time, the policy is invoked, and the organisation discovers whether it actually believes in the framework. Teams that survive that test have a real reliability practice. Teams that do not have a document.
Lesson 61 picks up the thread by going one layer deeper into the most common SLI category for data products: data quality. The SLO says “less than 0.1% of expected rows are missing”; the data-quality testing practice is what produces that number reliably. SLOs without quality testing are guesses. Quality testing without SLOs is busywork.