Observability for data: logs, metrics, traces, lineage

The previous two lessons built up the orchestration story: pick a tool, prefer the asset-oriented framing where it fits, lean on managed offerings to skip the cluster-ops tax. Once the pipelines are running, the next question is the one every operator eventually asks at 03:00 on a Tuesday: what is happening, why is it happening, and which thing is responsible. That question is observability, and a data platform without good answers to it is a data platform that occasionally produces wrong numbers and you only find out from a Slack message at the end of the quarter.

The web-services world standardised on three pillars of observability years ago: logs, metrics, traces. Books have been written, conferences run, vendors built. The data world inherits all three and adds a fourth that web services do not need: lineage. Knowing how a request flowed through your services is the web-services version. Knowing how a number flowed through your tables is the data version. They are different problems with different tools, and a serious data platform needs both.

This lesson walks through the four pillars, the standards (OpenTelemetry, OpenLineage) that have started to unify them, the vendors (Datadog, Honeycomb, Grafana stack), and the lineage-specific tools (Marquez, DataHub, Atlan, Monte Carlo). The goal is the picture you need before Module 8’s reliability lessons, which depend on being able to see what you are reliable about.

The three pillars (the web-services inheritance)

The three pillars framing comes from the observability community circa 2017 and has held up well. Each pillar answers a different question about a running system.

Logs are timestamped text records of events. They answer “what happened” with the most context. A log line might say “user 1234 attempted to log in at 14:32:15.412 and failed because the password did not match”, which is precise and human-readable. Logs are easy to write (every language has a logger), structured logs with key-value fields are now standard, and the tooling for searching them is mature.

The cost of logs is volume. A busy service produces gigabytes per day. Storing them is cheap; querying them across long time ranges is not. Logs are also hard to aggregate: counting “failed login attempts in the last hour” by scanning text logs is slow if you have a billion of them. The standard tools handle this with indexed log stores: ELK (Elasticsearch, Logstash, Kibana), Datadog Logs, Splunk, Grafana Loki. Each makes different trade-offs between query speed, storage cost, and operational complexity. Loki and Splunk sit at the extremes (cheap-and-simple vs expensive-and-powerful); the rest live between them.

Metrics are numerical time-series. They answer “how much, how often, how fast”. Number of requests per second, p99 latency, error rate, queue depth, pod CPU usage. Metrics are extremely cheap to store (a metric is a few bytes per sample) and fast to query (time-series databases are built for this). The constraint is cardinality: each unique combination of labels (region, service, user_id) creates a separate time-series, and high-cardinality labels (anything per-user) blow up the storage and query cost. The discipline of metrics is “what dimensions matter, what dimensions do not, keep the cardinality low”.

The dominant metrics tooling: Prometheus (open-source, the de facto standard for Kubernetes-native metrics), Datadog Metrics (managed, expensive, easy), CloudWatch (AWS-native, less powerful but adequate). Grafana sits on top of all of them as the visualisation layer; in 2026 most teams use Grafana even when their backend is something else.

Traces are end-to-end records of how a single request moves through a system. A request comes in to the API gateway, fans out to three services, each of which queries a database, and finally returns to the user. A trace stitches all of that together: each service’s contribution is a span, the spans link by parent-child relationships, and the resulting tree shows where time was spent.

Traces are the answer to “where is the latency?” in a distributed system. They are also the answer to “what did this request actually do?” when the bug is not in any one service but in the interaction between several. The tooling: Jaeger (open-source, simple), Zipkin (older, still around), Datadog APM (managed, full-featured), Honeycomb (managed, with a particular focus on high-cardinality querying), Lightstep (now part of ServiceNow, similar niche), Grafana Tempo (open-source, integrates with the rest of the Grafana stack).

The pillar framing is sometimes attacked for being too rigid (logs, metrics, traces are not always cleanly separated, and modern tools blur the boundaries), but as a starting categorisation it remains useful. A team that has all three pillars covered with reasonable retention and reasonable query speeds has the basics handled.

OpenTelemetry: the standard that unifies them

For most of the 2010s, instrumenting a service meant picking a vendor and using their SDK. New Relic had its agent, Datadog had its agent, Splunk had its, Honeycomb had its. Switching vendors meant ripping out the SDK and putting in a new one. The instrumentation lock-in was real and costly.

OpenTelemetry, born from the merger of OpenCensus and OpenTracing in 2019, is the response. It is a vendor-neutral specification (and a set of SDKs) for emitting telemetry data: traces, metrics, and logs in a unified format. You instrument your code with OpenTelemetry once. The OpenTelemetry collector receives the data and forwards it to whatever backends you choose: Datadog, Honeycomb, a self-hosted Jaeger, all three at once. Switching backends is a config change in the collector, not a code change in every service.

By 2026 OpenTelemetry is the default. Every modern observability vendor accepts OTLP (the OpenTelemetry Protocol). Most cloud-native tools auto-instrument with OpenTelemetry. The lock-in problem is largely solved for new projects; legacy services with vendor-specific agents are gradually migrating.

For a data platform, this matters because the data tools (Spark, Flink, Airflow, dbt) increasingly emit OpenTelemetry-compatible telemetry. A Spark job’s job-level metrics, a Flink streaming job’s checkpoint timings, an Airflow task’s runtime: all of it can flow through the same collector and into the same backend as the rest of the application stack. The data world stops being an island.

The data-engineering twist: lineage

The three pillars cover the runtime observability story well. They do not cover the question that defines a data platform’s specific failure modes: where did this number come from?

A web service’s failure mode is “this request returned a 500” or “this request was slow”. Logs, metrics, traces tell you why.

A data platform’s failure mode is “this number is wrong”. Logs, metrics, traces tell you whether the pipeline ran. They do not tell you whether the pipeline computed the right thing, where the input came from, what other downstream tables now contain the wrong number, or who is reading those downstream tables and making decisions on them. That is the lineage problem, and it is the fourth pillar.

Lineage answers four questions that the three pillars cannot:

Which upstream tables produced this column?
Which job last touched this column, and when?
What downstream tables and dashboards depend on this column?
Who consumes this dataset?

The first two are forensic: when something is wrong, you walk backward through the lineage to find the source. The last two are operational: when you want to deprecate a table or change a column, you walk forward through the lineage to find everyone affected.

Without lineage, both directions are manual archaeology: grep the codebase, ask around in Slack, hope the original author still works at the company. With lineage, both directions are a click.

The lineage tools

The lineage ecosystem in 2026 has matured around a small set of standards and players.

OpenLineage is the OpenTelemetry analogue for lineage. It is a vendor-neutral specification for emitting lineage events: “this job started”, “this job read from these tables”, “this job wrote to these tables”, “this job ended”. Tools that produce data emit OpenLineage events; tools that consume lineage (catalogues, observability platforms) receive them. By 2026 the major orchestrators (Airflow, Dagster, Prefect) and transformation tools (dbt, Spark, Flink) emit OpenLineage natively, either directly or through small adapters.

Marquez is the reference implementation of an OpenLineage backend. It is open-source, ingests OpenLineage events, and exposes a graph UI of jobs and datasets. Most shops do not run Marquez directly; they run a higher-level catalogue that uses OpenLineage as its ingestion protocol.

DataHub is the most widely deployed open-source data catalogue in 2026. Originated at LinkedIn, open-sourced in 2020. It ingests metadata from databases, dashboards, orchestrators, and pipelines, builds a unified graph, and exposes search, lineage browsing, and ownership information. Self-hosting DataHub is non-trivial (it has its own Kafka, Elasticsearch, MySQL, GraphQL API), but for shops large enough to need it, the operational cost is amortised by the platform-wide value.

Atlan, Monte Carlo, Lightup are the managed players. Atlan focuses on the catalogue and collaboration angle; Monte Carlo on data observability and anomaly detection (think: machine-learned alerts on row counts and freshness); Lightup on similar territory. The market is competitive in 2026 and the boundaries blur. The pattern is similar across them: ingest metadata and lineage from your stack, expose a UI, alert on anomalies.

dbt deserves a separate mention. dbt’s model graph is itself a lineage graph, and dbt has been emitting it natively since the project began. The integration with downstream catalogues (DataHub, Atlan, Monte Carlo) is well-trodden: every dbt run produces a manifest that the catalogue ingests, and the dbt models appear in the catalogue alongside the warehouse tables they materialise. For shops that have standardised on dbt for transformations, the lineage story largely writes itself; for shops that have not, the manual instrumentation cost is higher.

A small observable data system

Pulling the four pillars together, here is what a reasonably-instrumented data system looks like in 2026:

flowchart LR
    subgraph services[Services and pipelines]
        AF[Airflow]
        DBT[dbt]
        SPARK[Spark jobs]
        FLINK[Flink streaming]
        APP[Application services]
    end

    subgraph telemetry[Telemetry pipeline]
        OTEL[OpenTelemetry collector]
        OL[OpenLineage events]
    end

    subgraph backends[Observability backends]
        LOGS[(Logs<br/>Loki/Datadog)]
        METRICS[(Metrics<br/>Prometheus)]
        TRACES[(Traces<br/>Tempo/Honeycomb)]
        CAT[(Catalogue<br/>DataHub)]
    end

    subgraph ui[UI layer]
        GRAF[Grafana]
        DH[DataHub UI]
    end

    AF -->|logs, traces, metrics| OTEL
    DBT -->|logs, OpenLineage| OTEL
    DBT -->|OpenLineage| OL
    SPARK -->|logs, metrics| OTEL
    FLINK -->|logs, metrics, traces| OTEL
    APP -->|logs, metrics, traces| OTEL
    AF -->|OpenLineage| OL

    OTEL --> LOGS
    OTEL --> METRICS
    OTEL --> TRACES
    OL --> CAT

    LOGS --> GRAF
    METRICS --> GRAF
    TRACES --> GRAF
    CAT --> DH

Diagram to create: a polished version of the four-pillar instrumentation diagram. The visual point is that telemetry flows through two parallel channels (OpenTelemetry for the three runtime pillars, OpenLineage for the data lineage pillar), each lands in its own backend, and the UI layer sits on top. The standards converge the instrumentation; the backends and UIs are separate concerns.

The shape generalises. Most modern data platforms have something like this: one telemetry pipeline for runtime observability, one for lineage, both feeding into backends that the team queries through a small number of UIs.

How this comes together at 03:00

The test of an observability stack is what happens when something is wrong. Take a concrete failure. The customer_clv asset is wrong on the Tuesday morning dashboard.

The pipeline operator opens DataHub, finds the customer_clv asset, and walks backward through the lineage. The asset depends on customer_features, which depends on sessionized_events, which depends on raw_events. Two hours upstream, the raw_events ingestion job had a Spark error: an out-of-memory failure on the executor pod that the cluster restarted, but with a partial write that the downstream computations consumed without realising it was incomplete.

That’s all four pillars at work. Lineage pointed at the right upstream job. Logs (from the Spark driver pod) showed the executor failure. Metrics (executor memory usage over time) confirmed it was OOM. Traces (the trace of the orchestrator’s task that ran the job) showed the failure surfacing as a downstream task succeeding when it should have failed. Without any one of them, the diagnosis takes longer; with all four, it takes minutes.

This is the “you can’t manage what you can’t see” thesis in operational terms. Observability is not a luxury you bolt on after the platform is running; it is the precondition for a platform that anyone wants to be on call for. A pipeline you cannot see is a pipeline that breaks silently and gets caught by users instead of by operators.

What this lesson sets up for the rest of Module 8

The next lessons in Module 8 build on this foundation. Reliability practices (SLOs for data, on-call rotations, postmortems, incident response) all assume the team can see what is happening. Without observability, an SLO is aspirational; with it, the SLO is a number you can measure and a graph you can show stakeholders.

The same is true of the data-quality story (which sits adjacent to but is not exactly observability): tools like Great Expectations, dbt tests, and Monte Carlo’s anomaly detection assume there is a place to send the test results and an alert path that surfaces failures. The observability stack is that place and that path.

For 2026 the pragmatic recommendations: instrument with OpenTelemetry from day one. Pick a logs/metrics/traces backend that fits your scale (Grafana stack for cost-conscious teams, Datadog or Honeycomb for teams willing to pay for ergonomics, the cloud provider’s native stack for teams already deep in one cloud). Emit OpenLineage from your orchestrator and your transformation tools. Pick a catalogue (DataHub if you self-host, Atlan or Monte Carlo if you want managed). Connect them. Expect that the connection work is more annoying than the documentation suggests, and that the payoff arrives the first time someone asks “where did this number come from” and gets an answer in thirty seconds instead of three days.

Citations and further reading

OpenTelemetry documentation, https://opentelemetry.io/docs/ (retrieved 2026-05-01). The vendor-neutral standard for traces, metrics, and (increasingly) logs. The “concepts” section is the right starting point.
OpenLineage documentation, https://openlineage.io/docs/ (retrieved 2026-05-01). The lineage analogue, with a list of producers (Airflow, Spark, dbt, Dagster, Flink) and consumers (Marquez, DataHub).
Cindy Sridharan, “Distributed Systems Observability” (O’Reilly, 2018). The book that crystallised the three-pillars framing for the wider community.
Charity Majors, Liz Fong-Jones, George Miranda, “Observability Engineering” (O’Reilly, 2022). Honeycomb’s perspective on observability as a discipline, with the anti-three-pillars argument that is worth reading for balance.
DataHub documentation, https://datahubproject.io/docs/ (retrieved 2026-05-01). The catalogue that most self-hosted lineage stacks land on by 2026.
Datadog and Honeycomb engineering blogs, https://www.datadoghq.com/blog/ and https://www.honeycomb.io/blog/ (retrieved 2026-05-01). Useful for the vendor-perspective view on where observability is heading and how OpenTelemetry is reshaping the market.