Architecture
Single-server apps to global distributed systems. Storage choices, replication, streaming, orchestration, cost, and the case studies that show how it's done at scale.
Interactive
Reference architectures
Eight click-through blueprints with animated data flow: starter SaaS, the 80% standard, AI-augmented SaaS, Excel-to-BI, medallion lakehouse, event-driven, multi-region, data-team toolchain.
Open the gallery →Essay
The 80% architecture
The default modern SaaS stack that fits most companies. What it is, what it costs, why it's standard, and when to deviate.
Read the essay →
-
Lesson 1
What software architecture actually is
Definitions that aren't useless. Architecture as the set of decisions that are expensive to change later. The 'if it's hard to change, it's architecture' rule.
-
Lesson 2
Functional vs non-functional requirements
The trap of focusing on features. The qualities that drive architecture: latency, throughput, availability, durability, consistency, security, evolvability.
-
Lesson 3
The C4 model: context, container, component, code
A four-zoom-level diagramming convention that fits actual conversations about systems.
-
Lesson 4
Architectural Decision Records (ADRs)
Capturing decisions with their alternatives and consequences. The Michael Nygard format that most teams have converged on.
-
Lesson 5
Trade-offs are everything
Latency vs throughput, consistency vs availability, simple vs flexible. The named trade-off catalogue and why 'we want it all' is the most expensive request in the room.
-
Lesson 6
The first architecture: a single-server web app + database
What a working system looks like at startup scale: one VM, one Postgres, one process, and why that's an entirely respectable place to start.
-
Lesson 7
When the first architecture stops being enough
The symptoms that say the single-server app has reached its ceiling. Database contention, deploy-causes-outage, the daily backup that takes longer than a day.
-
Lesson 8
Three case studies of 'we should have started simpler'
Stripe staying on Postgres far longer than people expected, Shopify's monolith stance, and Basecamp's majestic monolith. The case for not over-engineering early.
-
Lesson 9
Why distributed systems are hard: the 8 fallacies
Peter Deutsch's list, restated for 2026. The network is reliable, latency is zero, bandwidth is infinite, and how each fallacy ruins someone's week.
-
Lesson 10
CAP theorem, in practice
What CAP actually says, what it doesn't say, and why 'AP system' is half a sentence. Real examples: DNS, bank ledgers, Cassandra.
-
Lesson 11
PACELC: what CAP missed
Daniel Abadi's extension. Even in the absence of partitions, you trade latency for consistency.
-
Lesson 12
Consistency models: strong, eventual, causal, monotonic
The spectrum of guarantees a system can offer, with a worked example showing what each model promises and breaks.
-
Lesson 13
Time in distributed systems: clocks, ordering, vector clocks
Physical time is a lie. Lamport timestamps, vector clocks, Google's TrueTime, and why 'when did this happen' is one of the hardest questions.
-
Lesson 14
Consensus: Paxos and Raft, in plain English
The safety/liveness guarantees of consensus protocols, why Raft replaced Paxos in modern systems, and the systems that depend on them.
-
Lesson 15
Two-phase commit and its problems
The textbook protocol for distributed transactions, the coordinator-failure problem that haunts it, and why modern systems lean on the Saga pattern instead.
-
Lesson 16
Idempotency, exactly-once, at-least-once, at-most-once
What each delivery guarantee actually promises, why 'exactly-once' is mostly a marketing claim, and how idempotent processing makes it irrelevant.
-
Lesson 17
Relational databases: when SQL is the right answer
Postgres as the default. ACID, schemas, joins, the 20-year-stable platform that quietly powers most of the world's transactional systems.
-
Lesson 18
Key-value stores: Redis, DynamoDB, when they win
Pure speed, pure simplicity. The use cases where a key-value store is the right answer: caching, sessions, rate limits, leaderboards.
-
Lesson 19
Document stores: MongoDB and the rise/fall/rebirth
When nested data is the model, what schema-on-read costs, and the operational lessons MongoDB taught the industry.
-
Lesson 20
Wide-column: Cassandra, ScyllaDB, BigTable
The 'infinite scale, query-tied schema' trade. What wide-column databases promise, what they sacrifice, and when the deal is worth it.
-
Lesson 21
Time-series databases: Influx, Timescale, Prometheus
When timestamp-plus-value is 99% of your data. The optimizations that let time-series stores beat general-purpose databases by 10x or more.
-
Lesson 22
Graph databases: Neo4j, when relationships are the data
The queries that are painful in SQL and trivial in Cypher: friend-of-a-friend, shortest path, recommendation systems built on relationship traversal.
-
Lesson 23
Vector databases: Pinecone, Qdrant, the LLM era
Embeddings as the new index, ANN (approximate nearest neighbor) search, and the new infra of the 2024-2026 LLM stack.
-
Lesson 24
Polyglot persistence: when to mix
When your application benefits from multiple databases, when one is enough, and the operational cost of running four data stores instead of one.
-
Lesson 25
Replication patterns: leader/follower, multi-leader, leaderless
The three families of database replication, the trade-offs each makes for consistency and availability, and where each fits in real systems.
-
Lesson 26
Replication lag and read-after-write consistency
The user-saw-stale-data bug. Why it happens with async replication, and the patterns to prevent it: read-your-writes, sticky sessions, monotonic reads.
-
Lesson 27
Partitioning: by key, by hash, by range
When one node can't hold the data, you split it. The three partitioning strategies and the queries each enables.
-
Lesson 28
Hot keys and the rebalancing problem
The celebrity user with one million followers. How to detect a hot key, three strategies to handle it, and why rebalancing a live cluster is harder than it sounds.
-
Lesson 29
Sharding strategies and their gotchas
Application-level sharding, database-native sharding, Citus and Vitess. The practical realities of running a sharded SQL database.
-
Lesson 30
Split brain: what it is and why it ruins everything
The network partition where both halves of a cluster think they're the leader. Why quorum is the only reliable defense.
-
Lesson 31
Cross-shard queries: fan-out vs co-location
When data is split across machines, every query has a cost in proportion to the number of shards it touches. The strategies for keeping that number down.
-
Lesson 32
Real case: Discord's MongoDB to Cassandra to ScyllaDB journey
How Discord's message storage went from MongoDB to Cassandra to ScyllaDB over ten years, what each migration cost, and what the lessons are for everyone else.
-
Lesson 33
ETL vs ELT: where the transform lives
The order of operations changed when warehouses got cheap. Why ELT (extract, load, transform) replaced ETL for most modern data stacks.
-
Lesson 34
Batch processing fundamentals: Hadoop's lessons
What MapReduce got right, what it got wrong, and the shape of batch processing that survived.
-
Lesson 35
Spark and modern batch
The in-memory replacement for Hadoop, the lessons it preserved, and the modern batch stack of 2026.
-
Lesson 36
The medallion architecture: bronze, silver, gold
Three layers of data refinement for a lakehouse. Why every modern data team uses some version of this naming, even when they don't call it 'medallion.'
-
Lesson 37
Lakehouses: Delta, Iceberg, Hudi
ACID transactions on object storage. The format wars of 2023-2025 and where the industry landed in 2026.
-
Lesson 38
Idempotent batch: making jobs safely re-runnable
Overwrite vs append vs upsert. The MERGE pattern. Why 'this job ran twice' should be a non-event.
-
Lesson 39
Backfilling and replay
The moment you discover a six-month-old bug and need to rerun every day since. The patterns that make backfills routine instead of terrifying.
-
Lesson 40
Real case: how Netflix runs daily batch on petabytes
Maestro orchestrator, Iceberg adoption, the cost-optimization layers that make daily batch on petabytes work.
-
Lesson 41
Why streaming: bounded vs unbounded data
The conceptual shift from batch to streaming. Why 'stream' is just 'batch with very small batches' in the limit, and why that limit changes the design.
-
Lesson 42
Kafka: the dominant log
Why Kafka became the integration spine of modern architecture. Topics, partitions, consumer groups, offsets, and the at-least-once guarantee.
-
Lesson 43
Stream processing: Flink, Kafka Streams, Spark Structured Streaming
Three engines for processing streams, when each fits, and why Flink is the heavyweight choice for complex stateful processing.
-
Lesson 44
Event time vs processing time, watermarks
Late-arriving data is the streaming problem nobody warns you about. Event time, watermarks, and the patterns that make windowed aggregations correct.
-
Lesson 45
Exactly-once semantics in streams
What Kafka transactions actually provide, the source-sink coordination problem, the limits, and why exactly-once across services is hard.
-
Lesson 46
CDC (Change Data Capture) and the dual-write problem
Debezium, Maxwell, AWS DMS. The dual-write problem and the outbox pattern that solves it.
-
Lesson 47
Lambda vs kappa architecture
The historical context: why Lambda existed, why Kappa replaced it, and when Lambda still has a place in 2026.
-
Lesson 48
Real case: Uber's real-time pipelines (Marmaray, Hudi origin)
Uber's evolution from batch-only to streaming-first, the data ingestion problem, and the Hudi project that came out of it.
-
Lesson 49
Git for engineering teams: branching strategies that work
Trunk-based, GitHub flow, gitflow. The realities at small vs large team scale, when each fits, and the patterns that survived 15 years of practice.
-
Lesson 50
Trunk-based development: why most modern teams converged here
Short-lived branches, feature flags, continuous integration. The pattern Google, Facebook, Microsoft adopted at scale, and what it requires to work.
-
Lesson 51
CI for data pipelines: testing without burning a cluster
Unit testing transformations, sample-data integration tests, the local-first development loop. Why CI for data is different from CI for web services.
-
Lesson 52
CD for data: deployment patterns for batch and streaming
Blue-green, canary, dark launch. Why streaming jobs need different deploy patterns than web services, and how batch jobs deploy through their schedule.
-
Lesson 53
Infrastructure as code: Terraform, Pulumi, CDK
Declarative infrastructure, the state-file problem, the GitOps workflow. Three tools and where each fits.
-
Lesson 54
Containers: Docker for data jobs
Dockerfile patterns, multi-stage builds, the right base image, image registries. The container fundamentals every data engineer should know.
-
Lesson 55
Kubernetes for data: the good, the bad, the necessary
When k8s is the right tool, when it's overkill, the operator pattern, and the Spark/Airflow integrations that make data engineering on Kubernetes work.
-
Lesson 56
Real case: Stripe's deployment pipeline
Merge-to-deploy speed, the safety net of automated tests, the deploy-as-non-event culture. What Stripe's published engineering practices reveal about CI/CD at scale.
-
Lesson 57
Orchestration deep dive: Airflow, Prefect, Dagster, Argo Workflows
The four contenders, when each wins, asset-oriented vs task-oriented framing, and the managed vs self-hosted decision.
-
Lesson 58
Asset-oriented orchestration (Dagster's lesson)
Modeling tables and files as first-class objects. Why this approach pays off at scale and what it changes about how teams think about pipelines.
-
Lesson 59
Observability for data: logs, metrics, traces, lineage
The three pillars plus lineage. OpenTelemetry, Datadog, Honeycomb. Lineage tools (Marquez, OpenLineage, DataHub).
-
Lesson 60
SLOs, SLAs, error budgets for data products
Google SRE's framework applied to data: 'the dashboard updated by 9am' as a measurable, defensible commitment.
-
Lesson 61
Data quality: Great Expectations, Soda, dbt tests
Declarative data testing. The three tools, the patterns that work, and the trap of over-testing.
-
Lesson 62
Incident response: runbooks, postmortems, the blameless culture
Google SRE's incident lifecycle, the runbook format that works, the blameless postmortem, and why fixing process beats fixing people.
-
Lesson 63
On-call for data engineering
The realities of being on the rotation. Pager hygiene, escalation, hand-off, and the case for fewer alerts.
-
Lesson 64
Real case: how Airbnb runs their data platform
The Airflow origin story (Airbnb built it), the Minerva metrics layer, the Dataportal data discovery system, and the data quality framework. What Airbnb's published practices reveal about running a data platform at scale.
-
Lesson 65
The cost of cloud: the iceberg model
Compute is the line item everyone watches. Storage, egress, NAT, cross-AZ, requests, and log ingestion are the iceberg below the waterline. Where the bill actually goes, and why FinOps exists.
-
Lesson 66
Storage cost optimization: tiering, lifecycle, compaction
Hot data is a small fraction of total data, but it gets most of the access. Tiering, lifecycle policies, and Parquet compaction are the levers that bring storage cost in line with how the data is actually used.
-
Lesson 67
Compute cost optimization: spot, autoscaling, right-sizing
Three levers move most of the compute bill: Spot instances for interruptible workloads, autoscaling that responds to load without thrashing, and right-sizing the VMs that are mostly oversized. Reserved capacity covers the predictable baseline.
-
Lesson 68
Network cost: egress, cross-AZ, the surprise bill
The most overlooked line on the cloud bill. Egress pricing, cross-AZ traffic, NAT gateways, VPC endpoints, and the architectural patterns that keep network charges from becoming the dominant cost.
-
Lesson 69
Scaling 10x: what breaks, what survives
The 10x exercise. Which components scale linearly with horsepower, which hit walls, and the architectural patterns that survive an order-of-magnitude jump in load.
-
Lesson 70
Caching strategies: CDN, application, database
The three caching tiers, the four canonical cache patterns, the invalidation problem, and how to defend a hot key against the stampede that takes the database down.
-
Lesson 71
The 'rebuild it cheaper' decision
When the vendor invoice gets painful enough that building it in-house starts to look attractive. The honest math, when rebuilding works, when it does not, and the hybrid that often wins.
-
Lesson 72
Real case: how Pinterest cut their data infra cost in half
A multi-year cost-reduction programme on a multi-petabyte AWS data platform. Storage tiering, Spark efficiency, query rewrites, right-sizing, and the cultural changes that made the savings stick.
-
Lesson 73
Microservices: when, when not, the monolith comeback
The 2015-2020 microservices boom, the 2021+ pushback, and the modular monolith as the middle path. Conway's law, the distributed-systems tax, and how to pick by team size and scaling profile.
-
Lesson 74
Event-driven architecture: saga, choreography, orchestration
Services that talk by emitting events, the choreography vs orchestration choice, the saga pattern, and the 2026 toolset (Temporal, Step Functions, Camunda, Argo).
-
Lesson 75
Multi-region deployments: active-active, active-passive, follow-the-sun
Why teams go multi-region (latency, DR, compliance, capacity), the three deployment shapes, the hard problems (replication, conflicts, cost), and when not to bother.
-
Lesson 76
Disaster recovery: RTO, RPO, the drill
What disaster recovery actually means in practice. The four DR tiers, RTO and RPO as design dials, and the discipline of the drill that proves the plan works.
-
Lesson 77
Security architecture: least privilege, defense in depth
The security principles every system needs as load-bearing architecture. Least privilege, defense in depth, zero trust, and the IAM and network controls that turn principles into reality.
-
Lesson 78
Privacy and compliance: GDPR, CCPA, data residency
Privacy regulations as architectural drivers. Right to erasure, data residency, customer-managed keys, and the consent and audit infrastructure compliance frameworks require.
-
Lesson 79
ML platform architecture: feature store, model registry, serving
The five layers a modern ML platform standardised on, the train-serve skew problem the feature store was invented to solve, and the build-versus-buy calculus for each layer in 2026.
-
Lesson 80
Capstone: design a complete architecture for a fictional company at three scales
Eighty lessons of system architecture, condensed into one design exercise. The same fictional SaaS company, three scales, three architectures, and a guided tour of what changes and why. The closing lesson of the course.
-
The 80% architecture
The default modern SaaS stack that fits most companies. What it is, why it's standard, and when to deviate.