Data & System Architecture, from the ground up Lesson 58 / 80

Asset-oriented orchestration (Dagster's lesson)

Modeling tables and files as first-class objects. Why this approach pays off at scale and what it changes about how teams think about pipelines.

The previous lesson surveyed the four big orchestrators and noted that one of them (Dagster) makes a different choice about what an orchestrator is for. That choice is worth its own lesson, because it is the most interesting architectural shift in data orchestration since Airflow itself. The shift has a name (asset-oriented orchestration), a concrete tool that demonstrates it (Dagster, increasingly Prefect, in pieces other tools too), and consequences that play out in how teams talk about their pipelines.

This lesson is the theory. It explains the traditional task-oriented model, the asset-oriented alternative, what changes when you adopt the asset-oriented model, and where the costs land. The argument is not that asset-oriented orchestration is universally superior. It is that at sufficient scale and complexity, the task-oriented model starts asking the wrong questions, and the asset-oriented model starts answering the right ones.

The task-oriented model

Airflow defined the orthodoxy. A pipeline is a directed acyclic graph of tasks. Each task is a unit of work: a Python function, a SQL query, a shell command, a container run. The orchestrator schedules tasks, tracks their state, and enforces dependencies between them. The data those tasks produce is implicit. The orchestrator does not know what the task wrote, where it wrote it, or who downstream depends on it.

A typical Airflow DAG might look like this:

with DAG("customer_pipeline", schedule="@daily") as dag:
    extract = PythonOperator(task_id="extract_orders", ...)
    transform = PythonOperator(task_id="transform_orders", ...)
    load = PythonOperator(task_id="load_to_warehouse", ...)
    customer_metrics = PythonOperator(task_id="customer_metrics", ...)
    extract >> transform >> load >> customer_metrics

The orchestrator knows: there are four tasks, here is their order, here is when to run the DAG. The orchestrator does not know: extract produces an orders staging table, transform reads it and writes a cleaned table, load writes the warehouse fact table, customer_metrics reads the fact table and writes a daily metrics table. All of that data lives outside the orchestrator’s awareness, in S3 paths and warehouse tables that exist only because the tasks happened to write them.

This is fine when the system is small. Four tasks fit in your head; the data and the orchestration align in the heads of the engineers who built them. The model breaks at scale, in three places at once.

The dependency question gets hard. When you have hundreds of DAGs producing thousands of tables, asking “what depends on the customer_clv table?” requires reading task code or grep-ing the codebase. The orchestrator cannot answer the question, because it does not know about tables; it only knows about tasks.

The freshness question gets hard. Asking “is the customer_clv table fresh?” requires knowing which task last wrote it and when, then translating that into “the table is X hours old”. The orchestrator can tell you when the customer_clv task last succeeded. It cannot tell you when the table was last updated, because the writes are a side effect that the orchestrator did not observe.

The retry question gets coarse. When a downstream task fails, the natural question is “what assets need to be regenerated?” The task-oriented orchestrator cannot answer that, so it offers either “rerun this task” or “rerun the whole DAG”. Neither is precise.

These pains are tolerable at small scale and increasingly intolerable as the platform grows.

The asset-oriented model

The asset-oriented model inverts the relationship. Instead of declaring tasks and letting the data fall out as a side effect, you declare assets, which are the things the pipeline produces (tables, files, model artefacts), and the orchestrator derives the work from the asset graph.

A Dagster equivalent of the pipeline above:

from dagster import asset

@asset
def orders_raw():
    return extract_from_source()

@asset
def orders_cleaned(orders_raw):
    return clean(orders_raw)

@asset
def orders_warehouse(orders_cleaned):
    return load_to_warehouse(orders_cleaned)

@asset
def customer_metrics(orders_warehouse):
    return compute_metrics(orders_warehouse)

Each function declares an asset. The function arguments declare the dependencies (customer_metrics depends on orders_warehouse, which depends on orders_cleaned, and so on). The orchestrator reads this graph and knows: there are four assets, this is their dependency structure, here is what each one looks like, here is when each was last materialised.

The shift in framing is subtle and consequential. The team stops talking about “the customer_metrics task” and starts talking about “the customer_metrics asset”. The orchestrator stops being a job runner and starts being a data catalogue with execution attached. The questions you can ask change. “When was customer_metrics last refreshed?” is now a first-class query. “What depends on orders_warehouse?” is a first-class query. “Customer_metrics needs to be fresh every six hours; figure out what to run to keep it fresh” is a first-class scheduling primitive (Dagster calls this a freshness policy).

A worked example

Take a customer-data pipeline that produces, in order: a raw events table from the CDC stream, a sessionised events table aggregated to a session grain, customer features for the ML team, and a customer lifetime value model artefact.

The asset graph:

flowchart LR
    RE[raw_events] --> SE[sessionized_events]
    SE --> CF[customer_features]
    CF --> CLV[customer_clv_model]
    SE --> DM[daily_metrics_mart]
    CF --> CS[customer_segments]

Diagram to create: a polished version of the asset graph above, organised into three layers (raw, intermediate, mart). The visual point is that the graph is a graph of data, not a graph of jobs. Each box is a thing that exists in storage; each arrow is a dependency that lets you trace where data came from.

What the orchestrator now knows, that the task-oriented orchestrator did not:

  • The graph is a lineage graph. If raw_events is wrong, the orchestrator can compute exactly which downstream assets are stale: sessionized_events, customer_features, customer_clv_model, daily_metrics_mart, customer_segments.
  • Each asset has a last-materialised timestamp. The team can see “customer_features was last refreshed three hours ago” without reading task logs.
  • Each asset has a freshness policy. “customer_features must be fresh every six hours; if not, surface an alert.” The orchestrator runs whatever upstream assets are needed to satisfy the policy, in dependency order.
  • Retry is precise. If customer_features failed, the orchestrator knows that customer_clv_model and customer_segments are now stale and can be re-materialised independently of daily_metrics_mart, which is unaffected.

None of this is theoretically impossible in Airflow; it is, however, layered on top of a model that does not natively understand any of it. In Dagster it is the model.

The benefits

Four wins fall out of the asset-oriented framing.

Lineage for free. The asset graph IS the lineage graph. Lineage in a task-oriented system is something you generate by parsing SQL or by writing OpenLineage emitters that translate task events into asset events. In an asset-oriented system, lineage is the data structure the orchestrator already maintains. The catalogue UI shows it natively. Cross-team questions (“who reads the orders table?”) become a click instead of a code-archaeology project.

Smarter retries. When a failure or a code change invalidates an asset, the orchestrator can compute the precise downstream blast radius and re-materialise only what is affected. In task-oriented systems this either does not happen (you rerun the whole DAG) or requires manual intervention (an engineer figures out what to rerun). At scale, this distinction is the difference between minutes and hours of repair work.

A better mental model. Teams talk about data products, not jobs. A data product is the customer_clv asset, not the customer_clv_dag job that runs it. New engineers learn the system by reading the asset graph, which describes what exists; in task-oriented systems they learn by reading DAG files, which describe what runs. The first is shorter and more useful.

dbt integration becomes natural. dbt models are assets. The dbt project’s directed graph of models is itself an asset graph, with dependencies declared via the ref() function. Dagster (and increasingly other tools) reads the dbt project and integrates it directly into the asset graph: dbt models appear alongside Python-defined assets, lineage flows across the boundary, and the orchestrator and dbt agree on the same vocabulary. In a task-oriented system, the entire dbt run is one opaque task, and the lineage inside dbt is a separate concern that has to be reconstructed.

The costs

The asset-oriented model is not free. Three real costs apply.

Conceptual buy-in. Engineers used to thinking in tasks and DAGs have to relearn the model. “What is this DAG?” becomes “what is this asset?” “When does this run?” becomes “when does this need to be fresh, and what produces it?” The transition takes weeks of practice for a team and is a real friction. Junior engineers who learn asset-oriented from day one do not feel this; engineers with five years of Airflow muscle memory do.

Tool lock-in. Dagster is the most committed asset-oriented tool. Prefect 3.x has asset support but the model is less central. Airflow has nibbled around the edges with the Datasets API (introduced in 2.4 and expanded since), but it is bolted onto a task-oriented core. Argo does not have it at all. Once a team commits to asset-oriented orchestration, switching tools is a real migration, not a config change.

Migration cost from existing systems. A shop with a hundred Airflow DAGs that wants to move to Dagster is signing up for serious work. The DAGs do not translate mechanically; the asset boundaries have to be rethought, the freshness policies designed, the dbt integration set up. Most teams who make this move do it incrementally, asset by asset, over months. Big-bang migrations rarely work.

These costs are why asset-oriented orchestration is not the default for everyone. It is a choice, and like every architectural choice, it is worth making deliberately.

Why this matters more at scale

The asset-oriented model’s advantage scales with the size of the system.

For a small team with a dozen pipelines and fifty tables, the dependency questions are answerable by reading the codebase. The freshness questions are answerable by checking task logs. The retry questions are answerable by rerunning the affected DAG. The asset-oriented framing is a nice convenience, not a transformative win.

For a platform team with a thousand pipelines and ten thousand tables, the dependency questions are intractable by hand. “What does the customer_clv table depend on transitively, and what depends on it?” is the kind of question that needs to be answered in a UI by someone who is not the pipeline’s author. The freshness questions require dashboards. The retry questions require precision, because rerunning a thousand DAGs because one upstream change invalidated some leaf assets is not a reasonable strategy.

At that scale, the orchestrator that knows about assets is solving the right problem. The orchestrator that knows only about tasks is solving the wrong one and forcing the team to bolt on lineage, freshness tracking, and precise retries as separate systems.

This is the architectural argument for asset-oriented orchestration: as the data platform grows, the question that dominates is “what depends on what”. The asset-oriented orchestrator answers that question by construction. The task-oriented orchestrator answers it eventually, by accumulation of separate tools that you wire together.

Where this is going

The lesson here is not just about Dagster. The asset-oriented framing is a movement, and other tools are absorbing pieces of it. Airflow’s Datasets API, Prefect’s asset emitters, the OpenLineage standard (which lesson 59 covers in the observability context), and the broader convergence of orchestration with data catalogues all point in the same direction. The “task” was always a means; the “asset” was always the end. The tools are catching up to that.

For new platforms in 2026, the asset-oriented model is the default starting point unless there is a good reason otherwise (existing Airflow infrastructure, ML-heavy workloads better served by Argo, a strong organisational preference). For legacy platforms, the question is whether the migration cost pays back, which it usually does for shops large enough to feel the pain of the task-oriented model and not for shops small enough that the pain is theoretical.

The next lesson stays in the same neighbourhood: observability for data, where the asset-oriented framing meets the lineage tools (Marquez, OpenLineage, DataHub) that turn the asset graph into something the rest of the organisation can navigate. The thread connecting both lessons: a data platform that you cannot see is a data platform you cannot operate, and “see” means more than task status. It means knowing what you have, where it came from, when it was last touched, and who is reading it. The asset-oriented orchestrator is the half of that picture that lives where the work happens; the lineage and observability tooling is the half that lives where the rest of the organisation looks.

Citations and further reading

  • Dagster documentation, “Assets” section, https://docs.dagster.io/concepts/assets/software-defined-assets (retrieved 2026-05-01). The canonical reference for the software-defined asset model.
  • Nick Schrock, “Introducing Software-Defined Assets” (Dagster blog, 2022). The original framing post, which makes the case for assets as the unit of orchestration.
  • Apache Airflow documentation, “Datasets” section, https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html (retrieved 2026-05-01). Airflow’s response to the asset-oriented framing, useful as a comparison point.
  • “Data Pipelines Pocket Reference” (James Densmore, O’Reilly, 2021) and “Fundamentals of Data Engineering” (Joe Reis and Matt Housley, O’Reilly, 2022). Both treat orchestration in context and discuss the data-product framing that asset-oriented orchestration formalises.
Search