Real case: Uber's real-time pipelines (Marmaray, Hudi origin)

Module 6 ends the way Module 5 ended: with one company, one decade, and public engineering posts that turn the abstractions of the module into concrete machinery. Module 5 closed on Netflix, where the interesting story was a batch platform at petabyte scale. Module 6 closes on Uber, where the interesting story is the move from batch-only to streaming-first, the ingestion problem that came with it, and the Hudi project that came out of it. Both stories are about the same architectural surface (a lakehouse plus several engines plus an orchestration layer), and the difference is what drove the build: Netflix’s pressure was scale and cost; Uber’s pressure was latency.

The framing of lesson 47 sits behind this lesson. Uber did not deliberately choose Kappa over Lambda; they walked the path the industry walked, in public, with their conclusions visible in open-source projects. Marmaray and Hudi are the two artefacts that matter for this module. Marmaray is the ingestion framework. Hudi is the storage layer that made streaming-into-lake a real thing rather than a workaround.

The scale and the problem

Uber’s public talks and engineering posts give the rough shape. Tens of millions of trips per day across hundreds of countries, each trip producing dozens of events: requested, matched, started, location updates, paid, rated. Hundreds of thousands of drivers and millions of riders generating signals that have to be ingested, joined, modelled, scored, and acted on. Petabytes in the lake, thousands of services, hundreds of teams. A small set of decisions must happen in real time; a long tail can run nightly.

The Uber Engineering blog (https://www.uber.com/blog/engineering/data/, retrieved 2026-05-01) has been unusually generous with public posts about the architecture. The summary below stitches together posts about Marmaray, Hudi, the streaming platform, and the broader data stack, with citations at the end.

The pre-2014 stack was batch-only. Daily Hadoop jobs aggregated trip events for analytics, finance, and offline machine learning. That was fine for analytics, but the operational use cases (surge pricing, fraud detection, ETA models, real-time dashboards for the operations centres) needed answers in seconds or minutes. By the mid-2010s the company was bolting streaming pipelines onto the side of the batch stack, and the bolt-on was getting expensive.

Two specific pains drove the rewrites that produced Marmaray and Hudi.

Ingestion was custom per source. Data flowed from Kafka, MySQL replicas, Cassandra, schemaless internal stores, and a long tail of services. Every source had its own bespoke pipeline into Hadoop, written by whichever team had built the source first. Schemas drifted. Failures were debugged source by source. Adding a new sink meant writing a new pipeline for every source again.

The data lake could not absorb updates. Uber’s data is mostly mutable. A trip’s status changes as it progresses. A driver’s profile gets updated. A rider’s payment method is corrected. Hadoop’s append-only file model was a poor fit, and the workarounds (full daily snapshots, slowly changing dimension tables in nightly jobs) gave correct results once a day and hour-old data the rest of the time. No better than the batch-only stack they were trying to move past.

Marmaray: the ingestion framework

Marmaray was open-sourced by Uber in 2018 and described in the Uber Engineering blog post “Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework for Apache Hadoop” (https://www.uber.com/blog/marmaray-hadoop-ingestion-open-source/, retrieved 2026-05-01). The pitch is in the subtitle: a generic framework for moving data between any source and any sink.

The architectural idea is decoupling. Sources and sinks are pluggable. The framework provides the surrounding scaffolding: schema management, error handling, retries, monitoring, dispersal of derived data back out to other systems. The bulk of the heavy lifting runs on Spark.

The deployments at Uber covered Kafka into Hadoop (the firehose case), Cassandra and MySQL into Hadoop (snapshots and CDC from operational stores), and the reverse path from Hadoop into operational stores when models trained in batch needed to be served in production. One framework, many source-sink pairs, with the per-pair work limited to the connectors.

The generalised lesson: the value of an ingestion framework is not the connectors. The connectors are the easy part. The value is the consistency. Every pipeline goes through the same framework, with the same observability, retries, schema management, and operational story. Adding a new pipeline becomes configuration plus a connector, not a fresh design exercise. At Uber’s scale this was the difference between hundreds of bespoke pipelines and one platform.

Hudi: storage built for streaming-into-lake

Apache Hudi started at Uber in 2016 as the answer to the upsert problem. It was open-sourced in 2017, became an Apache Incubator project, and reached top-level status in 2020. The original Uber Engineering blog post “Uber’s Big Data Platform: 100+ Petabytes with Minute Latency” (https://www.uber.com/blog/uber-big-data-platform/, retrieved 2026-05-01) lays out the motivation, and the Apache Hudi documentation (https://hudi.apache.org/, retrieved 2026-05-01) describes the format in detail.

Hudi was designed for one specific shape of workload: streams of events, many of which are updates to entities that already exist in the lake. A trip event arrives, and it might be the first event for that trip (insert), or an update to a trip that is already in flight (update), or a correction to a trip that completed yesterday (late update). The storage layer had to make all three cases efficient and atomic.

Hudi exposes three key concepts.

Copy-on-Write tables (CoW). Each upsert rewrites the data files that contain the affected rows. Reads are fast (no merge step at read time), but writes are expensive: changing one row in a hundred-megabyte Parquet file means rewriting the whole file. CoW is the right pick when reads dominate and writes are concentrated in a small number of files (recent partitions), which is the common pattern for time-series telemetry.

Merge-on-Read tables (MoR). Each upsert appends a row-level delta to a log file alongside the base Parquet. Reads merge the base and the deltas at query time. Writes are cheap (just append a delta), but reads are slower because they pay the merge cost. Periodic compaction rewrites the deltas back into the base files, restoring read performance. MoR is the right pick when writes dominate or when individual records are updated frequently across the table, as with CDC mirrors of operational databases.

Snapshots and time travel. Hudi tracks an immutable timeline of commits. Every read picks a snapshot to query against, and you can ask for the table as of a specific timestamp or commit. The same machinery powers incremental queries: “give me the rows that changed between commit T1 and commit T2”, which is the natural primitive for downstream pipelines that want to consume only what is new.

Hudi was the first open-source storage format to take all three of these features seriously together. By the time Delta Lake (2019) and Iceberg (donated 2018) were maturing, Hudi already had production deployments at Uber and was the natural pick for upsert-heavy workloads. Lesson 37 covered the format wars and where the industry landed.

The Uber pipeline shape

Putting Marmaray, Hudi, Kafka, and Flink together gives the streaming-first stack that Uber’s public posts describe. The shape is recognisably Kappa-flavoured (lesson 47), with one storage layer feeding both real-time and batch consumers.

flowchart LR
    SVC[Application services<br/>rides, payments, location]
    K[(Kafka<br/>event log)]
    F[Flink<br/>real-time aggregates]
    M[Marmaray<br/>ingestion]
    H[(Hudi tables<br/>on HDFS or S3)]
    P[Presto / Trino<br/>interactive queries]
    S[Spark<br/>batch analytics]
    OPS[Operations dashboards<br/>surge, ETA]
    BI[BI and finance]
    ML[ML training]
    SVC --> K
    K --> F
    K --> M
    F --> OPS
    M --> H
    H --> P
    H --> S
    P --> BI
    S --> ML
    S --> BI

Diagram to create: a polished version of the Mermaid diagram above, with services on the far left, Kafka as the central spine, the two consumer paths (Flink for real-time, Marmaray for ingestion) splitting off the spine, Hudi as the storage layer in the middle, the three query engines (Flink, Presto, Spark) as compute on top of Hudi, and the consumers (operational dashboards, BI, ML) on the right. The visual point is that there is one event log feeding both the real-time and the batch worlds, and one storage layer feeding multiple query engines.

A trip event flows from the application service into Kafka. From Kafka it splits two ways. Flink reads it for the real-time aggregates that power surge pricing and the operational dashboards. Marmaray reads it (along with everything else from Kafka, MySQL, and Cassandra) and lands it in Hudi tables on the data lake. Presto and Spark read from Hudi for interactive queries and batch analytics. The same Hudi tables back the dashboards, the finance reports, the ML training datasets, and the historical replay path.

Two architectural properties fall out of this shape.

One storage layer, multiple query engines. Hudi tables are the single physical store. The choice of query engine is per workload: Presto for interactive analytics, Spark for heavy ETL, Flink for streaming. Switching engines does not require migrating data. Adding a new engine is a connector plus a benchmark. This decoupling is the architectural win, and it is why the open-table-format work (lesson 37) matters so much: the format is the contract that the engines all agree on.

Streaming-first changes the lake’s role. Pre-streaming, the data lake was a read-mostly batch destination. Post-streaming, it is a transactional store with continuous writes and atomic updates. The lake’s contract changed. Hudi (and Iceberg and Delta) are what the new contract looks like in practice. The patterns that data engineers use to interact with the lake (incremental queries, snapshot reads, time travel) are the patterns that flow from this contract change.

The lessons

Five takeaways, structured the way the Module 5 Netflix case structured them.

Streaming-first changes the role of the lake. The biggest architectural shift in Uber’s journey is not “we use Kafka and Flink now”. It is that the data lake stopped being a batch destination and became a transactional store. The patterns of interacting with it changed. The format had to change to support that. The mental model had to change. This is the move from “we read yesterday’s data” to “we read continuously updated data with snapshot semantics”, and it is the move Module 6 has been preparing for. The Lambda-versus-Kappa question of lesson 47 is, at one level, a question about whether your storage layer can support both reads at once.

Open-source what you build. Uber’s investment in Hudi paid back in community contributions, broader engine support, and standardisation. The Iceberg story (lesson 40) at Netflix and the Hudi story at Uber are the same shape: a company solves a hard internal problem, open-sources the solution, and rides the wave of community improvements thereafter. The cost of open-sourcing is real; the benefit (durable software, easier hiring, ecosystem alignment) is larger at scale. Smaller teams should adopt these formats rather than invent new ones.

One storage layer, multiple engines. The decoupling between table format and query engine is the most consequential modern change in the data stack. A 2010 warehouse owned storage and engine together; a 2026 lakehouse separates them. Hudi (or Iceberg or Delta) is the storage; Spark, Presto, Flink, and a dozen others are the engines. This separation lets a single team run interactive queries, batch ETL, and streaming jobs against the same data without copying it across systems.

Build versus buy at scale. Uber built Hudi because in 2016 no off-the-shelf option supported their upsert needs against object storage. That calculation was correct for Uber. It is wrong for almost every other team, and it is increasingly wrong now that Hudi, Iceberg, and Delta exist and are mature. The general principle: build platform components only when standard ones genuinely do not fit, and accept that the threshold for “build” is much higher than it feels. The Netflix Maestro story (lesson 40) made the same point about orchestrators.

The platform absorbs the complexity so the data engineers do not. Marmaray hides ingestion complexity behind a uniform framework. Hudi hides upsert complexity behind a uniform table abstraction. The data engineers writing Spark jobs against Hudi tables do not see the merge-on-read mechanics, the compaction scheduling, or the snapshot expiration. The platform team does. The architectural goal is to make the right thing the easy thing for the application teams, by absorbing the complexity into platform components rather than asking every team to handle it themselves.

Cross-references back into Module 6

The Uber stack exercises every primitive Module 6 introduced. Kafka (lesson 42) is the durable event log, the spine of the system. Stream processors (lesson 43) are Flink, computing real-time aggregates. Watermarks and event time are how those aggregates handle the late-arriving rides and out-of-order location updates that real ride-hailing data inevitably contains. The Lambda-versus-Kappa choice (lesson 47) is implicit: Uber runs streaming and batch as separate pipelines that share the same storage, which is structurally Kappa with two consumers. Hudi (introduced in lesson 37 in the lakehouse context) is the storage that makes the whole shape work.

The patterns transfer. A team running a hundred-gigabyte Hudi table against a small Kafka cluster is doing the same thing Uber is doing on petabytes. The diagrams are the same, the operational concerns are the same, the format is the same. The constants are different. The shape is not.

Module 6 closes here, and what comes next

Module 6 walked through streaming end to end. Why streaming is structurally different from batch (lesson 41), Kafka and the durable log (lesson 42), stream processors and stateful operators (lesson 43), event time and watermarks (lesson 44), exactly-once semantics in streaming (lesson 45), CDC and the unification of operational and analytical data (lesson 46), Lambda versus Kappa as the architectural framing (lesson 47), and now Uber as the case study that exercises all of them.

Modules 5 and 6 together are the data-platform half of the course. Modules 1 through 4 covered the architectural foundations. Modules 5 and 6 covered batch and streaming, the two halves of how data moves and gets transformed at scale. The two case studies (Netflix in Module 5, Uber in Module 6) are the concrete proof that the patterns generalise.

Module 7 starts a different conversation. Code, version control, CI/CD, and the surrounding software-engineering practices that turn architectural designs into running systems. The data platform will keep showing up as an example, but the topic shifts to how engineers actually deliver software, which is the practice that all the architecture sits inside. The vocabulary changes. The discipline of “make the right thing the easy thing” does not.

Citations and further reading

Uber Engineering Blog, “Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework for Apache Hadoop”, 2018, https://www.uber.com/blog/marmaray-hadoop-ingestion-open-source/ (retrieved 2026-05-01). The introduction to Marmaray, with the architecture and the use cases at Uber.
Uber Engineering Blog, “Uber’s Big Data Platform: 100+ Petabytes with Minute Latency”, 2018, https://www.uber.com/blog/uber-big-data-platform/ (retrieved 2026-05-01). The post that introduced Hudi to a wider audience and explained the streaming-into-lake motivation.
Uber Engineering Blog, data and platform posts indexed at https://www.uber.com/blog/engineering/data/ (retrieved 2026-05-01). The broader series covering Kafka deployment, the streaming platform, the real-time analytics stack, and subsequent evolutions of the architecture.
Apache Hudi documentation, https://hudi.apache.org/ (retrieved 2026-05-01). The canonical reference for the format, the table types (Copy-on-Write and Merge-on-Read), the timeline model, and the operational guidance on compaction and clustering.
“Designing Data-Intensive Applications” (Martin Kleppmann, O’Reilly, 2017), chapters 11 and 12. The standard reference for the streaming patterns that Uber’s stack instantiates.