Backfilling and replay · Narcis Miclaus

Picture the meeting. Finance has been reconciling a quarterly number against an external source and the two do not match. Investigation traces the gap to a transformation in your daily revenue pipeline that has been wrong for six months, ever since a refactor nobody flagged. Every downstream table built on top of that transformation has been wrong for six months. Dashboards, exec reports, ML training data, the numbers the board saw last quarter.

Now the question lands on you. How do we make it right.

This is the backfill problem, and every batch system above a trivial size eventually meets it. The previous lesson was about idempotency, which is what makes backfilling possible in the first place. This lesson is about the patterns that turn a backfill from a multi-week catastrophe into a Tuesday afternoon.

Why six months of “just rerun the job” is hard

The naive plan is “rerun the daily job for each of the past 180 days, in order, and the tables will be right”. The naive plan does not survive contact with reality, for four distinct reasons.

Cluster cost and wall-clock time. A daily job that takes four hours needs 720 hours of compute to rerun 180 days serially. That is a month of wall-clock time on a cluster that also has to run tonight’s job. If you parallelise, you need 180 times the cluster capacity for the duration. Either dimension hurts. The backfill is a workload on top of the steady-state workload, and the cluster has to absorb both.

Source-data preservation. The transformation reads inputs. Are those inputs still there? If the source is a transactional database that purges old rows after 90 days, the inputs from six months ago are gone, and rerunning against present-day source data does not reproduce the historical output. If the source is a log archived to object storage, you are fine. If the source is “yesterday’s snapshot of a table that gets overwritten daily”, you are not.

Idempotency. The backfill rewrites partitions that already have data. An appending job duplicates on rerun; a partition-overwrite job is safe. Lesson 38 covered this in detail. The relevance here is that a non-idempotent job cannot be backfilled without manual intervention, and a “mostly idempotent” job is a trap because the failure modes only show up under the unusual conditions of a backfill.

Downstream impact. The corrected table is read by dashboards, ML pipelines, exports, regulators. Some consumers cache; some are batch jobs that will see new data appear in historical partitions and become confused; some will silently use the old (still wrong) numbers because the last refresh happened before the backfill. The blast radius of “I am rewriting six months of a table” is larger than the table.

A backfill, in other words, is not a re-execution of a job. It is a small migration project, with a plan, a rollback story, and a communication thread.

The patterns that make backfills routine

Six patterns, each independently useful, that compound when you have all of them.

Keep raw data forever, or close to it. Object storage in 2026 costs roughly 0.02 USD per GB per month for cold storage. A petabyte of raw events for a year is on the order of a quarter of a million dollars annually. That is real money, and it is far less than the engineering cost of being unable to backfill when you need to. The Lambda architecture, articulated by Nathan Marz around 2011, made this explicit: a batch layer that recomputes derived results from an immutable raw log on demand. Keep the raw data, in its raw shape, in cheap immutable storage, with a retention policy long enough to backfill from. Whatever you derived was computed from something; keep the something.

Idempotent jobs. Every batch job should be safely re-runnable. Lesson 38 gave the patterns: partition-overwrite semantics, deterministic outputs, no in-place mutation. A job that is idempotent for tonight’s normal run is idempotent for a backfill of any historical day. A job that is not idempotent forces a manual cleanup step every time you backfill, which means in practice that you do not backfill, you write apologies.

Parallelisable backfills. The 180 days of the backfill are independent. The transformation for 2025-09-15 does not need 2025-09-14 to have run. (If it does, your job has hidden state across days; fix that first.) When days are independent, the backfill runs them in parallel, bounded only by cluster capacity and source-side rate limits. A 180-day backfill on a cluster sized for 30 parallel jobs finishes in 6 batch waves, not 180.

Versioned outputs. Do not write the corrected data on top of the wrong data in place. Write to a new version of the table. Some teams use a date-suffixed name (fct_revenue_v2_20260123), some an Iceberg or Delta snapshot, some a partition-level swap. The shape is the same: the new table sits next to the old, you validate, and you cut over atomically. The old version stays around for a retention window in case you need to roll back.

Watermark resets. Most batch pipelines run on a high-water-mark model: the job remembers the last partition it processed. To backfill, you reset the watermark for the affected window. This sounds trivial and is not, because the watermark is often spread across multiple systems (the orchestrator, the job’s own state, downstream consumers’ watermarks). A backfill plan needs a list of watermarks to reset and a list to advance again at the end.

Pause or design for downstream tolerance. Either you pause the downstream jobs that read the table you are rewriting, or you design them to tolerate updates to historical partitions. Pausing is operationally simple and politically expensive. Tolerance is architecturally harder and more durable: the downstream job is itself idempotent, picks up the new data on its next pass, and reports its own completeness so consumers know when to trust the answer. If you expect to backfill regularly, build for tolerance.

Replay, the streaming cousin

Replay is the streaming-system version of backfilling. Vocabulary worth introducing here, even though Module 6 is where it lives.

In a streaming system, the input is an event log (Kafka, Kinesis, Pulsar) with a retention window, and consumers read from offsets in that log. To replay is to reset a consumer’s offset to a historical position and re-process from there. If retention is seven days and your bug was introduced six days ago, replay is cheap: rewind the consumer, let it re-process, the corrected output appears in real time. If the bug was introduced a year ago, replay from the log alone is impossible; you need the batch layer to recompute from the durable raw archive.

Kafka makes replay structurally cheap because the log is the durable storage, consumers are just bookmarks, and rewinding a bookmark is one API call. The original Lambda architecture treated batch and stream as two paths to the same answer; the modern variant (sometimes called Kappa) collapses the two by giving the streaming layer enough retention that replay covers most cases batch was needed for. Module 6 walks through the trade-offs.

The relevant point here: replay is backfill in a different vocabulary. The patterns transfer. Keep the raw events. Make consumers idempotent. Version the outputs. Coordinate downstream.

Walking through a real plan

Concrete example. The daily revenue report has a bug in the FX-rate join. For 90 days, transactions in non-USD currencies have been converted using the wrong rate (the rate for the previous day, not the rate for the transaction day). The downstream tables are fct_daily_revenue (90 affected daily partitions) and three reports built on top of it (executive dashboard, finance reconciliation export, monthly board pack).

The plan, in the order it executes:

Freeze the affected reports. Notify the consumers (finance, exec ops, board prep) that the dashboard will read “as of yesterday” for the next 48 hours while a correction is in flight. Add a banner on the dashboard.
Verify the source data is intact. The FX rate table goes back two years, so the historical rates are available. The transaction table is partitioned by day and retained for three years, so the inputs for the affected 90 days are still there. Good.
Reset the watermark for fct_daily_revenue. The orchestrator stores the high-water-mark in a metadata table; we set it back 90 days for the affected job.
Backfill into a versioned table. The job writes to fct_daily_revenue__backfill_20260123 instead of fct_daily_revenue. The cluster is sized to run 15 days in parallel, so the 90-day backfill takes 6 batch waves, roughly 6 hours of wall clock.
Validate. A reconciliation job compares totals between the old and new tables for each affected day. Most differences should be small (the FX rate fix), and the sign and magnitude should match what we expect from the bug analysis. We pull a random sample of transactions and verify the corrected rates by hand against the FX source.
Cut over. A single atomic operation swaps the affected partitions of fct_daily_revenue to point at the new data. In Iceberg or Delta this is a metadata swap, not a data move. The old data stays on disk for the rollback window.
Trigger downstream rebuilds. The reports read from the corrected table; they are themselves idempotent and rebuild for the affected window.
Communicate. A note to finance, exec ops, and the board-prep team summarising the issue, the fix, the affected period, and the magnitude of the correction. Without this step, the engineering work was technically correct and organisationally invisible, which is the same as not having done it.
Decommission. After a retention window (say, two weeks) the old __backfill_ table is dropped.

Diagram to create: a horizontal flowchart showing the nine steps above as a swim lane. Top lane: data engineering actions (freeze, reset watermark, backfill, cut over, decommission). Bottom lane: communication actions (notify consumers, validate with finance, send summary). Arrows between lanes show the coordination points.

flowchart LR
    A[Freeze reports] --> B[Verify source data]
    B --> C[Reset watermark]
    C --> D[Backfill to versioned table]
    D --> E[Validate against source]
    E --> F[Atomic cutover]
    F --> G[Rebuild downstream]
    G --> H[Communicate]
    H --> I[Decommission old version]

The plan is not exotic. It is the same playbook every team that runs serious batch infrastructure converges on, because the alternatives (in-place rewrites, manual SQL fixes, “we’ll just be more careful next time”) have all been tried and have all failed.

The cost-benefit of “store raw forever”

The single most consequential architectural choice for backfilling is whether you keep the raw inputs. Teams that lose this fight do so for predictable reasons: storage seems expensive in the abstract, compliance pushes toward shorter retention, and nobody wants to defend a year of raw event data at the bill review.

The counter-argument is the bug you have not discovered yet. Every team I have seen that operates serious batch infrastructure has had at least one moment where the difference between “we can fix this” and “we have to write an apology to the regulator” was whether the raw data was still around. Object storage at 2026 prices is the cheapest insurance the data platform sells.

The legal and privacy carve-outs are real and need a plan. PII has retention obligations shorter than what you would want for backfilling; the answer is to keep the raw event log with PII fields tokenised or tombstoned after the legal retention window, while keeping the non-PII parts longer. More work than “keep everything forever”, but it preserves backfill capability for the bulk of the data while honouring the rules for the sensitive parts.

Module 5 closes here, almost

The next lesson is the case study that puts all of Module 5 into motion at petabyte scale. Netflix runs one of the largest and most public batch data platforms in the industry, and reading their architecture against the abstractions of this module is the cleanest way to see how the pieces fit when the stakes are real and the data is enormous. After that, Module 6 starts on streaming.

Citations and further reading

Nathan Marz and James Warren, “Big Data: Principles and best practices of scalable real-time data systems”, Manning, 2015. The book that articulated the Lambda architecture and the “keep the raw log forever” principle.
Jay Kreps, “Questioning the Lambda Architecture”, O’Reilly Radar, 2014, https://www.oreilly.com/radar/questioning-the-lambda-architecture/ (retrieved 2026-05-01). The Kappa-architecture rebuttal that argues a long-retention stream layer subsumes the batch layer.
Apache Iceberg documentation, “Maintenance and snapshot expiration”, https://iceberg.apache.org/docs/latest/maintenance/ (retrieved 2026-05-01). The mechanics of versioned tables and atomic snapshot swaps that make safe backfills possible.
“Designing Data-Intensive Applications” (Martin Kleppmann, O’Reilly, 2017), chapter 10. The standard reference for batch processing, idempotency, and the operational realities that this lesson summarises.