Data & System Architecture, from the ground up Lesson 65 / 80

The cost of cloud: the iceberg model

Compute is the line item everyone watches. Storage, egress, NAT, cross-AZ, requests, and log ingestion are the iceberg below the waterline. Where the bill actually goes, and why FinOps exists.

Module 8 closed with the operational practices that turn a data platform into something a team can actually run. Module 9 opens with the next layer of seriousness: the platform has to be affordable. A reliable platform that costs three times what the business expected gets cut. A cheap platform that drops data costs the business more than it ever saves. Cost is not the opposite of reliability or performance; it is the third axis a mature team has to balance.

This lesson sets up the module by describing the shape of a cloud bill. The headline finding, which every FinOps practitioner has internalised and most engineers have not, is that compute is only the visible part. The real bill is an iceberg, and the part below the waterline is where surprise charges live. Storage, network egress, NAT gateways, cross-AZ traffic, request counts, log ingestion, and a long tail of managed-service line items make up the bulk of the bill at most mature companies.

The framing matters because the levers an engineer reaches for instinctively are compute levers. Right-size the EC2 instance. Use Spot. Scale down at night. These work, and lessons 67 covers them in detail. But a team that focuses only on compute will solve a third of the problem and discover the other two-thirds when the egress bill triples after a partner integration ships.

What the bill actually looks like

Public surveys, vendor reports, and the FinOps Foundation’s State of FinOps reports converge on a rough composition for a typical mid-size company on AWS. The exact percentages vary by workload, but the shape is recognisable across most data-heavy organisations.

  • Compute (EC2, EKS nodes, Fargate, Lambda, EMR): around 30 percent.
  • Storage (S3, EBS, EFS, snapshots, backups): around 20 percent.
  • Network (egress to internet, NAT Gateway data processing, cross-AZ traffic, inter-region transfer): around 15 percent.
  • Managed services (RDS, DynamoDB, Aurora, Redshift, Snowflake on AWS, Kinesis, MSK, OpenSearch): around 15 percent.
  • Data transfer adjacent items (CloudFront, Direct Connect, Transit Gateway, VPC endpoints): around 10 percent.
  • Everything else (security tooling, monitoring, third-party SaaS through Marketplace, GPU instances for ML, KMS, Secrets Manager, miscellaneous): the remaining 10 percent.

The numbers from the FinOps Foundation’s annual reports (https://www.finops.org/, retrieved 2026-05-01) are in the same neighbourhood, with notable variation by industry. Streaming-heavy companies see network climb past compute. Analytics-heavy companies see storage and managed warehouse spend dominate. ML-heavy companies see GPU compute and accelerator-attached storage take a larger slice. Cloudflare’s annual bandwidth pricing report (https://blog.cloudflare.com/aws-egregious-egress/, retrieved 2026-05-01) makes the case that AWS egress in particular is priced an order of magnitude above the underlying cost, which is why Cloudflare R2 and similar zero-egress object stores have a meaningful market.

The point is not the precise percentages, which shift over time and across teams. The point is that compute is roughly a third of the bill, and the team that only watches compute is watching the wrong scoreboard.

The iceberg

flowchart TB
    subgraph Visible["Above the waterline (what dashboards show)"]
        Compute["EC2 / EKS nodes / Lambda<br/>~30 percent"]
    end
    subgraph Hidden["Below the waterline (what the bill actually contains)"]
        Storage["S3 / EBS / snapshots / backups<br/>~20 percent"]
        Network["Egress / NAT / cross-AZ / inter-region<br/>~15 percent"]
        Managed["RDS / DynamoDB / Redshift / Kinesis<br/>~15 percent"]
        Transfer["CloudFront / Transit Gateway / VPC endpoints<br/>~10 percent"]
        Misc["Logs / KMS / Secrets / third-party SaaS<br/>~10 percent"]
    end
    Visible --> Hidden

A few of the line items below the waterline are worth calling out individually because they catch teams by surprise more than the others.

NAT Gateway data processing. A NAT Gateway in AWS charges around 4.5 cents per gigabyte processed, on top of the hourly fee per gateway. A workload that pulls a hundred gigabytes a day from S3 through a NAT Gateway, instead of through a VPC endpoint, costs about 135 dollars a month for the data-processing fee alone, and that is one workload. Multiplied across a fleet, a missing VPC endpoint configuration can add thousands of dollars a month for traffic that should be free. AWS pricing pages (https://aws.amazon.com/vpc/pricing/, retrieved 2026-05-01) have the canonical figures.

Cross-AZ traffic. AWS charges roughly 1 cent per gigabyte in each direction for traffic between availability zones in the same region, for a round trip of 2 cents per gigabyte. A chatty microservice mesh across three AZs can generate terabytes a day of cross-AZ chatter. A team running Kafka with brokers in three AZs and consumers in three AZs pays for every replicated and consumed message that crosses an AZ boundary. Topology-aware producers and consumers (rack-awareness in Kafka, zone-aware routing in service meshes) exist precisely to keep this cost in check.

Egress to the internet. Outbound traffic from AWS to the public internet runs around 9 cents per gigabyte for the first tier, dropping with volume but never going to zero. A workload that backfills a terabyte to a partner is a 90-dollar one-off that nobody flags until it shows up in next month’s bill. A workload that streams a continuous gigabit of data to a partner is north of 30,000 dollars a month, and the team that set up the integration is often unaware until Finance asks.

Request counts. S3 charges per request: roughly half a cent per thousand PUT requests and a twentieth of a cent per thousand GET requests. A pipeline that writes ten million tiny files a day is paying for ten million PUTs, and the same pipeline reading those files back is paying again. The fix (file compaction) is the topic of lesson 66, but the cost trap is worth flagging here.

Log ingestion. CloudWatch Logs charges around 50 cents per gigabyte ingested. A misconfigured application that logs every request body at DEBUG level can ingest hundreds of gigabytes a day, for thousands of dollars a month, with the engineer who set the log level long since moved to another team. Datadog, Splunk, and other third-party log platforms have the same shape with their own pricing tiers.

The classic surprise-bill stories

Every team that has been on AWS for more than a year has at least one of these stories. The patterns repeat enough to be worth cataloguing.

S3 access logs that ingested themselves. A team turned on S3 access logs for an analytics bucket and configured the destination as the same bucket. Each access generated a log entry, the log entry was an access, the access generated a log entry. The bucket grew exponentially over a weekend, and Monday’s storage bill was a couple of orders of magnitude above expectation. The fix is trivial (separate destination bucket, lifecycle policy on the log bucket). The lesson is that recursive configurations are easy to set up and produce non-linear bills.

The forgotten EBS volume. A team terminated an EC2 instance during a refactor but left the EBS volume attached as detached storage, with the “delete on termination” flag set to false. The volume kept charging at the GB-month rate for two years before someone noticed during an audit. Multiplied across the fleet, untagged orphan volumes are a perennial line item in cost-optimisation engagements.

The open egress that backfilled a terabyte. A new partner integration shipped on a Friday. The integration backfilled a terabyte of historical data over the weekend through a NAT Gateway with no caps and no monitoring. Monday’s bill showed the egress and the NAT processing for a single weekend at around 135 dollars, plus the egress at around 90 dollars per terabyte. Repeat the backfill for ten partners and the line item becomes visible enough that Finance asks questions.

The Kafka cluster across three AZs. A team set up Kafka with three brokers, one per AZ, for high availability. Producers and consumers were in arbitrary AZs. Every produce, every replicate, every consume that crossed an AZ boundary cost cross-AZ traffic. After six months, the cross-AZ line item on the Kafka workload was larger than the EC2 line item for the brokers themselves. The fix (rack-awareness, partition leadership pinning, careful consumer placement) is real engineering work and has to be designed in, not bolted on.

The CloudWatch Logs DEBUG default. A new service shipped with DEBUG-level logging in production by mistake. CloudWatch Logs ingestion ran around 50 cents per gigabyte, the service logged 200 GB a day, and the line item was 100 dollars a day or 3,000 dollars a month. The service ran like that for two months before a bill review caught it. The remediation is straightforward (tighten log levels, add ingestion budgets, alert on per-service log spend). The pattern repeats.

The shared theme: each of these is invisible to the team running the workload. The cost shows up on a bill the team does not look at, in a category they do not own, attributed to a service they did not realise had the line item. Without a structure to surface the costs back to the team that caused them, nothing changes.

FinOps as a discipline

FinOps is the operational discipline that makes cloud cost a first-class concern alongside reliability and performance. The FinOps Foundation (https://www.finops.org/, retrieved 2026-05-01) is the industry body, and the FinOps Framework documents the practices a mature organisation adopts. The term itself dates to around 2017, with the Foundation forming in 2019 and reaching critical mass through 2022 to 2024.

The framework rests on three phases that loop continuously.

Inform. Get the data. Tag every resource with a team, product, and environment. Build cost dashboards that attribute spend to the team that caused it. Make the bill visible to the people who can change it.

Optimise. Use the data. Right-size compute (lesson 67), tier storage (lesson 66), buy reserved capacity for predictable baselines, kill the orphan resources, fix the egress paths. The optimisation lessons of Module 9 fit here.

Operate. Keep using the data. Cost reviews on a regular cadence. Cost goals as part of team OKRs. Anomaly alerts when a line item jumps. FinOps becomes part of the operating rhythm, not a one-off project.

The FOCUS specification (FinOps Open Cost and Usage Specification, https://focus.finops.org/, retrieved 2026-05-01) is the recent attempt at a vendor-neutral schema for billing data, so a multi-cloud organisation can analyse AWS, Azure, and GCP costs in a unified shape. As of 2026, FOCUS 1.0 is published and most major clouds export FOCUS-conformant billing data. For a team that runs on a single cloud, FOCUS is overkill. For a team that runs on two or more, it is the substrate that makes consolidated cost analysis tractable.

The dashboard everyone needs

The dashboard a mature data platform team has, and a maturing one is in the process of building, breaks the cloud bill down along three axes.

By team. Which team’s spend grew this month? Which team is the largest line item? Tagging discipline is what makes this possible: every resource carries a team tag, the cost-allocation tags are enabled in the billing console, and the dashboard groups by tag.

By product. Within a team, which product or service is the spend concentrated in? Useful for capacity planning, useful for budget conversations with product managers, useful when a product gets retired and the cost should fall off a cliff.

By environment. Production, staging, development, sandbox. The classic finding is that staging and development together cost 30 to 50 percent of production, and most of that is workloads left running over weekends and overnight that did not need to be. Auto-shutdown of non-production environments is one of the highest-ROI cost interventions a team can implement.

The dashboard does not have to be exotic. AWS Cost Explorer with cost-allocation tags enabled, a couple of saved views, and a weekly review on the team’s calendar covers most of the value. Vendor tools (CloudHealth, Apptio Cloudability, Vantage, Spot.io, the ecosystem of FinOps platforms) add depth for organisations large enough to justify the spend on the tooling itself, but the underlying discipline is what matters.

Where the rest of the module goes

This lesson set the stage. Lesson 66 takes the storage axis and goes deep: tiering across S3 storage classes, lifecycle policies, the Glacier retrieval-cost trap, file compaction for Parquet, and the automatic optimisation features in Iceberg and Delta. Lesson 67 takes the compute axis: Spot and preemptible instances, autoscaling done well versus done badly, right-sizing as a regular discipline, and reserved capacity for predictable baselines. Subsequent lessons cover query-level cost (warehouse query optimisation), architectural patterns that compound over time (the cost implications of multi-region versus single-region, of microservices versus monoliths from a cost-of-traffic standpoint), and the cultural question of how a team makes cost a shared concern.

The thread through all of it is the iceberg framing. Every cost decision has a visible part the engineer optimises and a hidden part the bill records. The team that reads the bill, attributes it back to the work that caused it, and treats cost as a real concern alongside reliability and performance is the team that gets to keep its platform when budget season comes around.

Citations and further reading

  • AWS pricing pages, https://aws.amazon.com/pricing/ (retrieved 2026-05-01). The canonical reference for compute, storage, and network pricing referenced throughout the lesson. Specific subpages for VPC (https://aws.amazon.com/vpc/pricing/), S3 (https://aws.amazon.com/s3/pricing/), and EC2 (https://aws.amazon.com/ec2/pricing/) carry the per-line-item figures.
  • FinOps Foundation, https://www.finops.org/ (retrieved 2026-05-01). The industry body for cloud financial management. The annual State of FinOps report has the rough composition figures cited above.
  • FinOps Foundation, FOCUS specification, https://focus.finops.org/ (retrieved 2026-05-01). The vendor-neutral billing schema for multi-cloud cost analysis.
  • Cloudflare, “AWS’s Egregious Egress”, https://blog.cloudflare.com/aws-egregious-egress/ (retrieved 2026-05-01). The case for why egress pricing is the most distorted line item on a typical cloud bill, and the context for why zero-egress object stores have a market.
  • J. R. Storment and Mike Fuller, “Cloud FinOps”, O’Reilly, 2nd edition (2023). The book that established FinOps as a recognised discipline; the second edition is the current reference text.
Search