On-call for data engineering

Module 8 has built up the operational side of running a data platform. Orchestration (lesson 57), asset-oriented thinking (lesson 58), observability (lesson 59), SLOs (lesson 60), data quality (lesson 61), incident response (lesson 62). Each of those lessons assumes a human is reachable when things go wrong. This lesson is about the rotation that produces the human, the rules that keep it sustainable, and the engineering work that determines whether being on-call is a manageable part of the job or the thing that makes people quit.

Good on-call is invisible: the pager rarely fires, the rotation is shared fairly, hand-offs are crisp, the people on-call sleep. Bad on-call is corrosive: the pager fires constantly, the same two people carry it all, alerts get ignored on principle, and eventually a real incident slips through because the team has been trained not to look.

What on-call is

On-call is a rotation in which one person (or a small group) is responsible for responding to alerts during a defined window. The window can be 24/7 for systems that need round-the-clock coverage, or business hours for systems where overnight breakage is acceptable. For data engineering, the alerts are about pipeline failures, data quality regressions, SLO burn, infrastructure incidents, and the long tail of “the platform did something unusual and a human needs to look”.

The argument for on-call existing at all is straightforward. Data systems are too important to be unattended. A 3 a.m. pipeline failure that nobody fixes until morning means the morning dashboard is wrong and the morning meeting is uninformed. A data quality regression that goes undetected for a day pollutes downstream tables, models, and reports, and the cleanup cost grows roughly linearly with how long it sat. The goal of on-call is not to prevent failures; failures will happen. The goal is to bound the time between failure and human attention, so the blast radius stays small.

The rotation shape varies. A four-person team might rotate weekly. A larger team might split primary and secondary roles. A geographically distributed team might follow the sun, so nobody gets paged at 3 a.m. local time. The right rotation is whichever one keeps the load distributed and the response times sane.

The most important on-call discipline can be stated in one sentence: every alert that fires must be actionable.

Actionable means there is something the on-call can do right now to fix or mitigate the problem. Restart a job, scale up a cluster, fail over to a replica, page a downstream owner, mark a table as stale, run a backfill. Something concrete. If an alert fires and the on-call’s only response is “I noticed”, the alert is not actionable, and it should not be paging a human at 3 a.m.

This sounds obvious and is violated everywhere. Two common antipatterns:

Informational alerts. Someone wired up a metric, set a threshold, and pointed it at the pager because that was the easiest place to send notifications. The metric is interesting; it is not actionable. Examples: “queue depth went above 10,000”, “this job took 20 minutes longer than usual”, “disk usage passed 60 percent”. None require immediate action. They belong in a dashboard, a daily digest, or a Slack channel nobody is obliged to read at night. The pager is a scarce resource.

Noisy alerts. An alert fires often, is usually a false positive, and is routinely acknowledged-and-ignored. The team has trained itself to treat it as background noise. Eventually the alert fires for a real reason, and the on-call ignores it the same way. The alert that nobody trusts is worse than no alert at all, because the absence of an alert is at least an honest signal.

The fix is to treat every false-positive alert as a bug, with three possible resolutions: tune the threshold, fix the underlying flakiness, or delete the alert. There is no fourth option. Letting the alert keep firing and keep being ignored is a slow corruption of the team’s response discipline.

A practical heuristic: if an alert has fired more than twice without producing an action, it needs work. Make alert maintenance an explicit part of the on-call’s job, not an after-hours volunteer activity.

Escalation

The on-call cannot fix everything alone. Escalation is the documented path from “primary on-call needs help” to “the right person is engaged”. A typical structure:

Primary on-call. First responder. Acknowledges every alert, triages, fixes what they can, escalates what they cannot.

Secondary on-call. Backup. Engaged when the primary is unreachable, overwhelmed, or needs a second pair of eyes.

Manager. Engaged when the incident has business impact that needs a decision the on-call cannot make alone: customer communication, vendor escalation, authorisation to take production-affecting actions.

Subject-matter expert. A specific person who knows a specific system: the one who built the warehouse ingestion, the one who maintains the CDC pipelines. Not on the rotation, but reachable for incidents that hit their area.

The escalation path needs to be documented and tested. Documented means a runbook lists who to call and how to reach them, with the expected response time at each tier. Tested means the team runs fire drills, quarterly is a reasonable cadence. The drill surfaces the stale phone numbers, the misconfigured on-call schedules, and the SMEs who have been on parental leave for three months without anyone updating the rota. Better to find this out in a drill than at 4 a.m. during a real incident.

Hand-off

A rotation that does not hand off cleanly hides incidents. The outgoing on-call has accumulated context over the shift: the flapping alert, the open ticket nobody has closed, the half-finished investigation, the runbook change that needs review. If that context does not transfer, the incoming person starts cold.

The hand-off is a written note, a short meeting, or both. The written note is more reliable because it survives the meeting and is searchable later. A reasonable note covers:

Recent incidents during the shift, with current status.
Open issues that might page tonight, with the reason they are open and the mitigation.
Known flaky alerts, with the workaround the outgoing on-call has been using.
Anything scheduled for the next shift: planned deploys, ongoing migrations, maintenance windows.

The note lives in a shared place: a wiki page, a shared document, a channel the team reads. Not in private notes or a DM.

Compensation

On-call is real work, and the rotation interrupts personal time and sleep. Pretending otherwise is how teams produce burnout.

Compensation takes several forms: extra cash per on-call shift paid whether or not the pager fires, comp time after a rough shift, or both. The structure varies by region and employer, and labour-law constraints differ across jurisdictions: some countries treat on-call as paid working time by default. The minimum acceptable practice is that on-call is acknowledged as work and compensated in some form.

The point is not just fairness. Compensation is a feedback mechanism. If on-call is paid, every paged shift has a cost the organisation can see, and the case for fixing noisy alerts has a number on it. Free on-call has no such feedback, which is one reason free on-call rotations stay noisy.

Burnout signals

A sustainable rotation produces few signals of distress. An unsustainable one produces several, and a manager paying attention can spot them early.

Too many alerts per shift. A shift that pages the on-call ten times a night is an emergency. A healthy data team’s on-call should be paged at most a handful of times per week, not per night, and many shifts should pass with no overnight pages at all.

Alerts during family time. A pager that consistently fires during dinner, weekends, or holidays is taxing the on-call’s life beyond what the rotation contracts for.

Disproportionate load. One person carrying twice as many incidents as the others usually means the system has a single point of human failure: a subsystem only one person understands. The fix is documentation and gradual transfer of expertise, not punishing the person currently carrying the load.

Reluctance to take the rotation. People volunteering for vacation in their on-call weeks. People asking to swap out repeatedly. People citing on-call in exit conversations. By the time these signals are visible, the situation is already serious.

The manager’s job is to spot these signals and act. The action is rarely about the rotation itself. It is about the system driving it: which alerts fire, why, and what investments would make the platform easier to operate. On-call quality is a downstream symptom of platform quality. Fixing the symptom alone does not work.

The case for fewer alerts

If the on-call gets paged twice a night, the platform has been designed badly. The statement is uncomfortable and it is true. Healthy systems page on-call rarely. The rare pages are for real incidents.

The path from many alerts to few is the move from threshold-based to SLO-based alerting (lesson 60). Threshold-based alerting fires every time a metric crosses a value, regardless of whether the crossing matters. CPU at 80 percent, queue depth at 1,000, job runtime at 30 minutes: each is a number, none is inherently a problem. SLO-based alerting fires when the SLO is at risk: when the rate of error budget burn over a meaningful window suggests the customer-facing promise is going to break. The signals are coupled to what the team has committed to deliver, not to the internal mechanics of how it delivers them.

The shift requires investment. The team has to pick SLOs that capture what matters. The metrics have to be reliable and low-noise. The alerting rules have to fire on real budget burn, not cosmetic threshold crossings. Most teams arrive at this gradually, by retiring noisy threshold alerts as they replace them with better SLO-based ones.

The reward is a quieter, more honest pager. A pager that fires for SLO burn fires when something the team cares about is at risk. The trust between team and pager is restored.

The alert lifecycle

The mechanical flow from a metric being out of bounds to an incident being closed is worth seeing as one picture.

flowchart LR
    M[Monitoring<br/>metrics, logs, checks]
    AM[Alert manager<br/>routing, dedup, suppression]
    P[Primary on-call<br/>pager]
    A{Acknowledge}
    T{Triage}
    R[Resolve]
    E[Escalate<br/>secondary, manager, SME]
    PIR[Post-incident review<br/>lesson 62]
    M --> AM
    AM --> P
    P --> A
    A --> T
    T --> R
    T --> E
    E --> R
    R --> PIR

Diagram to create: a polished flowchart of the alert lifecycle. Monitoring on the far left feeds the alert manager, which routes to the primary on-call’s pager. The on-call acknowledges, triages, and either resolves directly or escalates to secondary, manager, or subject-matter expert. Resolution feeds into the post-incident review process from lesson 62. The visual point is that every alert has a closed loop: it ends either in resolution or in handoff to someone who can resolve it.

The alert manager between monitoring and pager is doing real work: deduplicating, suppressing alerts during known maintenance, routing categories to different people. A team without an alert manager will eventually drown in duplicates. The acknowledge step is how the team knows the alert was seen, and how the system stops re-paging until the situation is handled.

The connection to everything else in Module 8

Every other lesson in this module pays a debt to on-call. Orchestration (lesson 57) determines which jobs page when they fail. Asset-oriented thinking (lesson 58) determines whether the on-call sees a single failed asset or a cascade of confused jobs. Observability (lesson 59) determines whether the on-call finds the root cause in five minutes or fifty. SLOs (lesson 60) determine whether the pager fires for things that matter. Data quality (lesson 61) determines whether the on-call is debugging real failures or chasing false positives. Incident response (lesson 62) determines whether each incident produces learning, or the same incidents recur.

When all of those work, on-call is a manageable part of being on the team. When any one is missing, on-call is where it shows up first.

Module 8 ends here

This is the last lesson of Module 8. The next lesson is the module’s case study: how Airbnb runs their data platform. Module 9 then opens with cost optimisation, the natural sibling to operational reliability and the next big lever a mature data team has to pull.

Citations and further reading

Google SRE Book, “Being On-Call” chapter, https://sre.google/sre-book/being-on-call/ (retrieved 2026-05-01). The canonical written description of healthy on-call practice, from the team that arguably invented the modern shape of it.
Google SRE Workbook, “On-Call” chapter, https://sre.google/workbook/on-call/ (retrieved 2026-05-01). The practical companion to the SRE book, with concrete advice on rotations, hand-offs, and burnout.
PagerDuty, “Incident Response Documentation”, https://response.pagerduty.com/ (retrieved 2026-05-01). PagerDuty’s open-source incident response guide, including on-call rotation patterns and escalation playbooks.
Atlassian, “On-call best practices”, https://www.atlassian.com/incident-management/on-call (retrieved 2026-05-01). Practical guidance from a vendor whose tooling sits on this exact problem.