Architecture · Programming

Interactive

Reference architectures

Eight click-through blueprints with animated data flow: starter SaaS, the 80% standard, AI-augmented SaaS, Excel-to-BI, medallion lakehouse, event-driven, multi-region, data-team toolchain.

Open the gallery →

Essay

The 80% architecture

The default modern SaaS stack that fits most companies. What it is, what it costs, why it's standard, and when to deviate.

Read the essay →

Course

Data & System Architecture, from the ground up

An 80-lesson course that starts at 'what is a single-server app' and ends at 'design a global multi-region system at three scales.' Heavy on real engineering case studies (Netflix, Uber, Stripe, Discord, Pinterest, Airbnb), heavy on diagrams, and built on the foundations of Designing Data-Intensive Applications, the SRE workbook, and a decade of post-mortems.

Published: 80 of 80

See the lessons

Lesson 1

What software architecture actually is

Published on 27 October 2025 11 min read Read

Definitions that aren't useless. Architecture as the set of decisions that are expensive to change later. The 'if it's hard to change, it's architecture' rule.
- #architecture
- #fundamentals
- #intro
Lesson 2

Functional vs non-functional requirements

Published on 29 October 2025 13 min read Read

The trap of focusing on features. The qualities that drive architecture: latency, throughput, availability, durability, consistency, security, evolvability.
- #architecture
- #requirements
- #fundamentals
Lesson 3

The C4 model: context, container, component, code

Published on 31 October 2025 12 min read Read

A four-zoom-level diagramming convention that fits actual conversations about systems.
- #architecture
- #c4
- #diagrams
- #fundamentals
Lesson 4

Architectural Decision Records (ADRs)

Published on 3 November 2025 11 min read Read

Capturing decisions with their alternatives and consequences. The Michael Nygard format that most teams have converged on.
- #architecture
- #adr
- #decisions
- #documentation
Lesson 5

Trade-offs are everything

Published on 5 November 2025 12 min read Read

Latency vs throughput, consistency vs availability, simple vs flexible. The named trade-off catalogue and why 'we want it all' is the most expensive request in the room.
- #architecture
- #tradeoffs
- #fundamentals
Lesson 6

The first architecture: a single-server web app + database

Published on 7 November 2025 15 min read Read

What a working system looks like at startup scale: one VM, one Postgres, one process, and why that's an entirely respectable place to start.
- #architecture
- #monolith
- #postgres
- #startup
Lesson 7

When the first architecture stops being enough

Published on 10 November 2025 10 min read Read

The symptoms that say the single-server app has reached its ceiling. Database contention, deploy-causes-outage, the daily backup that takes longer than a day.
- #architecture
- #scaling
- #monolith
- #performance
Lesson 8

Three case studies of 'we should have started simpler'

Published on 12 November 2025 10 min read Read

Stripe staying on Postgres far longer than people expected, Shopify's monolith stance, and Basecamp's majestic monolith. The case for not over-engineering early.
- #architecture
- #monolith
- #case-study
- #simplicity
Lesson 9

Why distributed systems are hard: the 8 fallacies

Published on 14 November 2025 11 min read Read

Peter Deutsch's list, restated for 2026. The network is reliable, latency is zero, bandwidth is infinite, and how each fallacy ruins someone's week.
- #architecture
- #distributed-systems
- #fallacies
- #fundamentals
Lesson 10

CAP theorem, in practice

Published on 17 November 2025 10 min read Read

What CAP actually says, what it doesn't say, and why 'AP system' is half a sentence. Real examples: DNS, bank ledgers, Cassandra.
- #architecture
- #cap-theorem
- #distributed-systems
- #consistency
Lesson 11

PACELC: what CAP missed

Published on 19 November 2025 10 min read Read

Daniel Abadi's extension. Even in the absence of partitions, you trade latency for consistency.
- #architecture
- #pacelc
- #consistency
- #latency
Lesson 12

Consistency models: strong, eventual, causal, monotonic

Published on 21 November 2025 9 min read Read

The spectrum of guarantees a system can offer, with a worked example showing what each model promises and breaks.
- #architecture
- #consistency
- #distributed-systems
Lesson 13

Time in distributed systems: clocks, ordering, vector clocks

Published on 24 November 2025 10 min read Read

Physical time is a lie. Lamport timestamps, vector clocks, Google's TrueTime, and why 'when did this happen' is one of the hardest questions.
- #architecture
- #distributed-systems
- #time
- #clocks
Lesson 14

Consensus: Paxos and Raft, in plain English

Published on 26 November 2025 10 min read Read

The safety/liveness guarantees of consensus protocols, why Raft replaced Paxos in modern systems, and the systems that depend on them.
- #architecture
- #consensus
- #paxos
- #raft
- #distributed-systems
Lesson 15

Two-phase commit and its problems

Published on 28 November 2025 12 min read Read

The textbook protocol for distributed transactions, the coordinator-failure problem that haunts it, and why modern systems lean on the Saga pattern instead.
- #architecture
- #two-phase-commit
- #transactions
- #saga
Lesson 16

Idempotency, exactly-once, at-least-once, at-most-once

Published on 1 December 2025 12 min read Read

What each delivery guarantee actually promises, why 'exactly-once' is mostly a marketing claim, and how idempotent processing makes it irrelevant.
- #architecture
- #idempotency
- #delivery-semantics
- #messaging
Lesson 17

Relational databases: when SQL is the right answer

Published on 3 December 2025 10 min read Read

Postgres as the default. ACID, schemas, joins, the 20-year-stable platform that quietly powers most of the world's transactional systems.
- #architecture
- #postgres
- #mysql
- #sql
- #databases
Lesson 18

Key-value stores: Redis, DynamoDB, when they win

Published on 5 December 2025 10 min read Read

Pure speed, pure simplicity. The use cases where a key-value store is the right answer: caching, sessions, rate limits, leaderboards.
- #architecture
- #redis
- #dynamodb
- #key-value
- #cache
Lesson 19

Document stores: MongoDB and the rise/fall/rebirth

Published on 8 December 2025 10 min read Read

When nested data is the model, what schema-on-read costs, and the operational lessons MongoDB taught the industry.
- #architecture
- #mongodb
- #document-store
- #json
- #nosql
Lesson 20

Wide-column: Cassandra, ScyllaDB, BigTable

Published on 10 December 2025 9 min read Read

The 'infinite scale, query-tied schema' trade. What wide-column databases promise, what they sacrifice, and when the deal is worth it.
- #architecture
- #cassandra
- #scylladb
- #bigtable
- #wide-column
- #nosql
Lesson 21

Time-series databases: Influx, Timescale, Prometheus

Published on 12 December 2025 10 min read Read

When timestamp-plus-value is 99% of your data. The optimizations that let time-series stores beat general-purpose databases by 10x or more.
- #architecture
- #time-series
- #influxdb
- #timescaledb
- #prometheus
Lesson 22

Graph databases: Neo4j, when relationships are the data

Published on 15 December 2025 10 min read Read

The queries that are painful in SQL and trivial in Cypher: friend-of-a-friend, shortest path, recommendation systems built on relationship traversal.
- #architecture
- #graph
- #neo4j
- #cypher
- #traversal
Lesson 23

Vector databases: Pinecone, Qdrant, the LLM era

Published on 17 December 2025 10 min read Read

Embeddings as the new index, ANN (approximate nearest neighbor) search, and the new infra of the 2024-2026 LLM stack.
- #architecture
- #vector-databases
- #embeddings
- #llm
- #ann-search
Lesson 24

Polyglot persistence: when to mix

Published on 19 December 2025 11 min read Read

When your application benefits from multiple databases, when one is enough, and the operational cost of running four data stores instead of one.
- #architecture
- #polyglot-persistence
- #databases
- #design
Lesson 25

Replication patterns: leader/follower, multi-leader, leaderless

Published on 22 December 2025 9 min read Read

The three families of database replication, the trade-offs each makes for consistency and availability, and where each fits in real systems.
- #architecture
- #replication
- #leader-follower
- #multi-leader
- #leaderless
Lesson 26

Replication lag and read-after-write consistency

Published on 24 December 2025 10 min read Read

The user-saw-stale-data bug. Why it happens with async replication, and the patterns to prevent it: read-your-writes, sticky sessions, monotonic reads.
- #architecture
- #replication
- #consistency
- #read-after-write
Lesson 27

Partitioning: by key, by hash, by range

Published on 26 December 2025 11 min read Read

When one node can't hold the data, you split it. The three partitioning strategies and the queries each enables.
- #architecture
- #partitioning
- #sharding
- #hash
- #range
Lesson 28

Hot keys and the rebalancing problem

Published on 29 December 2025 10 min read Read

The celebrity user with one million followers. How to detect a hot key, three strategies to handle it, and why rebalancing a live cluster is harder than it sounds.
- #architecture
- #hot-keys
- #rebalancing
- #partitioning
Lesson 29

Sharding strategies and their gotchas

Published on 31 December 2025 10 min read Read

Application-level sharding, database-native sharding, Citus and Vitess. The practical realities of running a sharded SQL database.
- #architecture
- #sharding
- #citus
- #vitess
- #postgres
- #mysql
Lesson 30

Split brain: what it is and why it ruins everything

Published on 2 January 2026 11 min read Read

The network partition where both halves of a cluster think they're the leader. Why quorum is the only reliable defense.
- #architecture
- #split-brain
- #network-partition
- #quorum
Lesson 31

Cross-shard queries: fan-out vs co-location

Published on 5 January 2026 10 min read Read

When data is split across machines, every query has a cost in proportion to the number of shards it touches. The strategies for keeping that number down.
- #architecture
- #sharding
- #queries
- #fan-out
Lesson 32

Real case: Discord's MongoDB to Cassandra to ScyllaDB journey

Published on 7 January 2026 11 min read Read

How Discord's message storage went from MongoDB to Cassandra to ScyllaDB over ten years, what each migration cost, and what the lessons are for everyone else.
- #architecture
- #discord
- #case-study
- #mongodb
- #cassandra
- #scylladb
Lesson 33

ETL vs ELT: where the transform lives

Published on 9 January 2026 9 min read Read

The order of operations changed when warehouses got cheap. Why ELT (extract, load, transform) replaced ETL for most modern data stacks.
- #architecture
- #etl
- #elt
- #data-engineering
- #warehouse
Lesson 34

Batch processing fundamentals: Hadoop's lessons

Published on 12 January 2026 10 min read Read

What MapReduce got right, what it got wrong, and the shape of batch processing that survived.
- #architecture
- #hadoop
- #mapreduce
- #batch
- #big-data
Lesson 35

Spark and modern batch

Published on 14 January 2026 10 min read Read

The in-memory replacement for Hadoop, the lessons it preserved, and the modern batch stack of 2026.
- #architecture
- #spark
- #databricks
- #batch
- #in-memory
Lesson 36

The medallion architecture: bronze, silver, gold

Published on 16 January 2026 9 min read Read

Three layers of data refinement for a lakehouse. Why every modern data team uses some version of this naming, even when they don't call it 'medallion.'
- #architecture
- #medallion
- #bronze-silver-gold
- #data-lake
- #databricks
Lesson 37

Lakehouses: Delta, Iceberg, Hudi

Published on 19 January 2026 9 min read Read

ACID transactions on object storage. The format wars of 2023-2025 and where the industry landed in 2026.
- #architecture
- #delta-lake
- #iceberg
- #hudi
- #lakehouse
- #acid
Lesson 38

Idempotent batch: making jobs safely re-runnable

Published on 21 January 2026 9 min read Read

Overwrite vs append vs upsert. The MERGE pattern. Why 'this job ran twice' should be a non-event.
- #architecture
- #idempotency
- #batch
- #merge
- #upsert
Lesson 39

Backfilling and replay

Published on 23 January 2026 10 min read Read

The moment you discover a six-month-old bug and need to rerun every day since. The patterns that make backfills routine instead of terrifying.
- #architecture
- #backfill
- #replay
- #batch
- #lambda-architecture
Lesson 40

Real case: how Netflix runs daily batch on petabytes

Published on 26 January 2026 11 min read Read

Maestro orchestrator, Iceberg adoption, the cost-optimization layers that make daily batch on petabytes work.
- #architecture
- #netflix
- #case-study
- #batch
- #iceberg
- #maestro
Lesson 41

Why streaming: bounded vs unbounded data

Published on 28 January 2026 10 min read Read

The conceptual shift from batch to streaming. Why 'stream' is just 'batch with very small batches' in the limit, and why that limit changes the design.
- #architecture
- #streaming
- #real-time
- #batch
Lesson 42

Kafka: the dominant log

Published on 30 January 2026 10 min read Read

Why Kafka became the integration spine of modern architecture. Topics, partitions, consumer groups, offsets, and the at-least-once guarantee.
- #architecture
- #kafka
- #streaming
- #log
- #integration
Lesson 43

Stream processing: Flink, Kafka Streams, Spark Structured Streaming

Published on 2 February 2026 10 min read Read

Three engines for processing streams, when each fits, and why Flink is the heavyweight choice for complex stateful processing.
- #architecture
- #flink
- #kafka-streams
- #spark
- #structured-streaming
Lesson 44

Event time vs processing time, watermarks

Published on 4 February 2026 9 min read Read

Late-arriving data is the streaming problem nobody warns you about. Event time, watermarks, and the patterns that make windowed aggregations correct.
- #architecture
- #streaming
- #event-time
- #watermarks
- #late-data
Lesson 45

Exactly-once semantics in streams

Published on 6 February 2026 9 min read Read

What Kafka transactions actually provide, the source-sink coordination problem, the limits, and why exactly-once across services is hard.
- #architecture
- #exactly-once
- #kafka-transactions
- #streaming
- #idempotency
Lesson 46

CDC (Change Data Capture) and the dual-write problem

Published on 9 February 2026 10 min read Read

Debezium, Maxwell, AWS DMS. The dual-write problem and the outbox pattern that solves it.
- #architecture
- #cdc
- #debezium
- #outbox-pattern
- #streaming
Lesson 47

Lambda vs kappa architecture

Published on 11 February 2026 11 min read Read

The historical context: why Lambda existed, why Kappa replaced it, and when Lambda still has a place in 2026.
- #architecture
- #lambda-architecture
- #kappa-architecture
- #streaming
- #batch
Lesson 48

Real case: Uber's real-time pipelines (Marmaray, Hudi origin)

Published on 13 February 2026 12 min read Read

Uber's evolution from batch-only to streaming-first, the data ingestion problem, and the Hudi project that came out of it.
- #architecture
- #uber
- #case-study
- #marmaray
- #hudi
- #streaming
Lesson 49

Git for engineering teams: branching strategies that work

Published on 16 February 2026 10 min read Read

Trunk-based, GitHub flow, gitflow. The realities at small vs large team scale, when each fits, and the patterns that survived 15 years of practice.
- #architecture
- #git
- #branching
- #version-control
Lesson 50

Trunk-based development: why most modern teams converged here

Published on 18 February 2026 12 min read Read

Short-lived branches, feature flags, continuous integration. The pattern Google, Facebook, Microsoft adopted at scale, and what it requires to work.
- #architecture
- #trunk-based-development
- #feature-flags
- #ci
Lesson 51

CI for data pipelines: testing without burning a cluster

Published on 20 February 2026 9 min read Read

Unit testing transformations, sample-data integration tests, the local-first development loop. Why CI for data is different from CI for web services.
- #architecture
- #ci
- #testing
- #data-pipelines
- #dbt
Lesson 52

CD for data: deployment patterns for batch and streaming

Published on 23 February 2026 9 min read Read

Blue-green, canary, dark launch. Why streaming jobs need different deploy patterns than web services, and how batch jobs deploy through their schedule.
- #architecture
- #continuous-deployment
- #deployment-patterns
- #data-pipelines
Lesson 53

Infrastructure as code: Terraform, Pulumi, CDK

Published on 25 February 2026 9 min read Read

Declarative infrastructure, the state-file problem, the GitOps workflow. Three tools and where each fits.
- #architecture
- #terraform
- #pulumi
- #cdk
- #iac
- #gitops
Lesson 54

Containers: Docker for data jobs

Published on 27 February 2026 9 min read Read

Dockerfile patterns, multi-stage builds, the right base image, image registries. The container fundamentals every data engineer should know.
- #architecture
- #docker
- #containers
- #dockerfile
Lesson 55

Kubernetes for data: the good, the bad, the necessary

Published on 2 March 2026 10 min read Read

When k8s is the right tool, when it's overkill, the operator pattern, and the Spark/Airflow integrations that make data engineering on Kubernetes work.
- #architecture
- #kubernetes
- #spark-on-k8s
- #airflow-on-k8s
- #operators
Lesson 56

Real case: Stripe's deployment pipeline

Published on 4 March 2026 11 min read Read

Merge-to-deploy speed, the safety net of automated tests, the deploy-as-non-event culture. What Stripe's published engineering practices reveal about CI/CD at scale.
- #architecture
- #stripe
- #case-study
- #deployment
- #ci-cd
- #monorepo
Lesson 57

Orchestration deep dive: Airflow, Prefect, Dagster, Argo Workflows

Published on 6 March 2026 11 min read Read

The four contenders, when each wins, asset-oriented vs task-oriented framing, and the managed vs self-hosted decision.
- #architecture
- #airflow
- #prefect
- #dagster
- #argo
- #orchestration
Lesson 58

Asset-oriented orchestration (Dagster's lesson)

Published on 9 March 2026 10 min read Read

Modeling tables and files as first-class objects. Why this approach pays off at scale and what it changes about how teams think about pipelines.
- #architecture
- #dagster
- #asset-oriented
- #lineage
- #data-products
Lesson 59

Observability for data: logs, metrics, traces, lineage

Published on 11 March 2026 11 min read Read

The three pillars plus lineage. OpenTelemetry, Datadog, Honeycomb. Lineage tools (Marquez, OpenLineage, DataHub).
- #architecture
- #observability
- #logs
- #metrics
- #traces
- #lineage
Lesson 60

SLOs, SLAs, error budgets for data products

Published on 13 March 2026 9 min read Read

Google SRE's framework applied to data: 'the dashboard updated by 9am' as a measurable, defensible commitment.
- #architecture
- #slo
- #sla
- #sre
- #error-budget
- #reliability
Lesson 61

Data quality: Great Expectations, Soda, dbt tests

Published on 16 March 2026 10 min read Read

Declarative data testing. The three tools, the patterns that work, and the trap of over-testing.
- #architecture
- #data-quality
- #great-expectations
- #soda
- #dbt-tests
Lesson 62

Incident response: runbooks, postmortems, the blameless culture

Published on 18 March 2026 10 min read Read

Google SRE's incident lifecycle, the runbook format that works, the blameless postmortem, and why fixing process beats fixing people.
- #architecture
- #incidents
- #runbooks
- #postmortems
- #sre
Lesson 63

On-call for data engineering

Published on 20 March 2026 10 min read Read

The realities of being on the rotation. Pager hygiene, escalation, hand-off, and the case for fewer alerts.
- #architecture
- #on-call
- #sre
- #alerting
- #pager
Lesson 64

Real case: how Airbnb runs their data platform

Published on 23 March 2026 11 min read Read

The Airflow origin story (Airbnb built it), the Minerva metrics layer, the Dataportal data discovery system, and the data quality framework. What Airbnb's published practices reveal about running a data platform at scale.
- #architecture
- #airbnb
- #case-study
- #airflow
- #minerva
- #data-quality
Lesson 65

The cost of cloud: the iceberg model

Published on 25 March 2026 11 min read Read

Compute is the line item everyone watches. Storage, egress, NAT, cross-AZ, requests, and log ingestion are the iceberg below the waterline. Where the bill actually goes, and why FinOps exists.
- #architecture
- #cost
- #cloud
- #finops
Lesson 66

Storage cost optimization: tiering, lifecycle, compaction

Published on 27 March 2026 10 min read Read

Hot data is a small fraction of total data, but it gets most of the access. Tiering, lifecycle policies, and Parquet compaction are the levers that bring storage cost in line with how the data is actually used.
- #architecture
- #cost
- #storage
- #s3
- #parquet
Lesson 67

Compute cost optimization: spot, autoscaling, right-sizing

Published on 30 March 2026 11 min read Read

Three levers move most of the compute bill: Spot instances for interruptible workloads, autoscaling that responds to load without thrashing, and right-sizing the VMs that are mostly oversized. Reserved capacity covers the predictable baseline.
- #architecture
- #cost
- #compute
- #spot
- #autoscaling
Lesson 68

Network cost: egress, cross-AZ, the surprise bill

Published on 1 April 2026 10 min read Read

The most overlooked line on the cloud bill. Egress pricing, cross-AZ traffic, NAT gateways, VPC endpoints, and the architectural patterns that keep network charges from becoming the dominant cost.
- #architecture
- #cost
- #network
- #egress
- #aws
Lesson 69

Scaling 10x: what breaks, what survives

Published on 3 April 2026 12 min read Read

The 10x exercise. Which components scale linearly with horsepower, which hit walls, and the architectural patterns that survive an order-of-magnitude jump in load.
- #architecture
- #scaling
- #capacity
- #performance
Lesson 70

Caching strategies: CDN, application, database

Published on 6 April 2026 13 min read Read

The three caching tiers, the four canonical cache patterns, the invalidation problem, and how to defend a hot key against the stampede that takes the database down.
- #architecture
- #caching
- #redis
- #cdn
- #cache-aside
Lesson 71

The 'rebuild it cheaper' decision

Published on 8 April 2026 11 min read Read

When the vendor invoice gets painful enough that building it in-house starts to look attractive. The honest math, when rebuilding works, when it does not, and the hybrid that often wins.
- #architecture
- #cost
- #build-vs-buy
- #vendor
Lesson 72

Real case: how Pinterest cut their data infra cost in half

Published on 10 April 2026 14 min read Read

A multi-year cost-reduction programme on a multi-petabyte AWS data platform. Storage tiering, Spark efficiency, query rewrites, right-sizing, and the cultural changes that made the savings stick.
- #architecture
- #pinterest
- #case-study
- #cost-optimization
Lesson 73

Microservices: when, when not, the monolith comeback

Published on 13 April 2026 9 min read Read

The 2015-2020 microservices boom, the 2021+ pushback, and the modular monolith as the middle path. Conway's law, the distributed-systems tax, and how to pick by team size and scaling profile.
- #architecture
- #microservices
- #monolith
Lesson 74

Event-driven architecture: saga, choreography, orchestration

Published on 15 April 2026 10 min read Read

Services that talk by emitting events, the choreography vs orchestration choice, the saga pattern, and the 2026 toolset (Temporal, Step Functions, Camunda, Argo).
- #architecture
- #event-driven
- #saga
- #choreography
- #orchestration
Lesson 75

Multi-region deployments: active-active, active-passive, follow-the-sun

Published on 17 April 2026 10 min read Read

Why teams go multi-region (latency, DR, compliance, capacity), the three deployment shapes, the hard problems (replication, conflicts, cost), and when not to bother.
- #architecture
- #multi-region
- #geography
- #latency
Lesson 76

Disaster recovery: RTO, RPO, the drill

Published on 20 April 2026 10 min read Read

What disaster recovery actually means in practice. The four DR tiers, RTO and RPO as design dials, and the discipline of the drill that proves the plan works.
- #architecture
- #disaster-recovery
- #rto
- #rpo
- #backups
Lesson 77

Security architecture: least privilege, defense in depth

Published on 22 April 2026 11 min read Read

The security principles every system needs as load-bearing architecture. Least privilege, defense in depth, zero trust, and the IAM and network controls that turn principles into reality.
- #architecture
- #security
- #iam
- #least-privilege
Lesson 78

Privacy and compliance: GDPR, CCPA, data residency

Published on 24 April 2026 11 min read Read

Privacy regulations as architectural drivers. Right to erasure, data residency, customer-managed keys, and the consent and audit infrastructure compliance frameworks require.
- #architecture
- #gdpr
- #ccpa
- #privacy
- #compliance
- #residency
Lesson 79

ML platform architecture: feature store, model registry, serving

Published on 27 April 2026 10 min read Read

The five layers a modern ML platform standardised on, the train-serve skew problem the feature store was invented to solve, and the build-versus-buy calculus for each layer in 2026.
- #architecture
- #ml-platform
- #feature-store
- #mlflow
- #serving
Lesson 80

Capstone: design a complete architecture for a fictional company at three scales

Published on 1 May 2026 12 min read Read

Eighty lessons of system architecture, condensed into one design exercise. The same fictional SaaS company, three scales, three architectures, and a guided tour of what changes and why. The closing lesson of the course.
- #architecture
- #capstone
- #course-summary