Storage cost optimization: tiering, lifecycle, compaction

Lesson 65 framed the cloud bill as an iceberg, with storage as one of the largest items below the visible waterline. This lesson takes that storage line and goes deep. Three levers carry most of the savings: tiering data across storage classes by access frequency, automating that movement with lifecycle policies, and compacting small files into larger ones so the file count stops driving request and metadata costs. A fourth lever, just deleting what is not needed, is the one that gets neglected because nobody is paid to delete things.

The lesson is grounded in S3 because S3 is the dominant object store and because AWS has the most public pricing detail. The same shapes apply to Azure Blob Storage tiers (Hot, Cool, Cold, Archive) and to Google Cloud Storage classes (Standard, Nearline, Coldline, Archive). The class names differ; the model is identical.

The 90/10 reality

A consistent finding across analytics workloads is that roughly 10 percent of the data accounts for 90 percent of the access. The freshest week of events drives most of the dashboards. The newest model artefact gets all the inference traffic. Last quarter’s invoices get queried for end-of-quarter reporting and then go quiet. The previous year’s data is mostly dormant, queried only for annual comparisons or compliance pulls.

This is the basis for tiering. Storing the dormant 90 percent on the same storage class as the hot 10 percent means paying premium rates for cold data. The economic argument for tiering is straightforward: cold-storage classes are 50 to 90 percent cheaper than the standard tier, the dormant data does not need millisecond access, and the small access tax for the rare retrieval is dwarfed by the storage savings on the bulk of the data.

The trap is that the savings depend on the access pattern matching the assumption. A workload that turns out to scan all of last year’s data once a month for a regulatory report can end up paying more under a cold-tier policy than it would on Standard, because retrieval charges and request costs eat the storage savings. The first job of a tiering strategy is to measure access, not to assume.

S3 storage classes

AWS exposes a lattice of storage classes optimised for different access patterns. Pricing is approximate and varies by region; the figures below use US East 1 list prices as of mid-2026 (https://aws.amazon.com/s3/pricing/, retrieved 2026-05-01) for relative comparison, not absolute budgeting.

Standard. Around 2.3 cents per gigabyte per month. The default. Millisecond access, 99.99 percent availability SLA, multi-AZ durability. The right tier for hot data.

Intelligent-Tiering. Same per-gigabyte cost as Standard for the frequent-access tier, with automatic movement to lower-cost tiers (Infrequent Access, Archive Instant Access, Archive Access, Deep Archive Access) based on observed access. A small per-object monitoring charge applies. The right tier when access patterns are unpredictable or vary across objects, because S3 does the tiering work.

Standard-IA. Around 1.25 cents per gigabyte per month, with a per-GB retrieval charge of around 1 cent and a 30-day minimum storage duration. Good for data accessed monthly but still needing millisecond access.

One Zone-IA. Around 1 cent per gigabyte per month. Same retrieval and minimum-duration model as Standard-IA, but data lives in a single AZ. Acceptable for re-creatable data (intermediate analytics outputs, secondary copies of media) where the AZ-loss risk is tolerable.

Glacier Instant Retrieval. Around 0.4 cents per gigabyte per month. Millisecond retrieval, but with a higher per-GB retrieval charge (around 3 cents per GB) and a 90-day minimum. The right tier for archive data that still needs to be queryable on demand, like compliance archives that an auditor might pull.

Glacier Flexible Retrieval. Around 0.36 cents per gigabyte per month. Retrieval takes minutes to hours, with retrieval charges that scale with speed (Expedited, Standard, Bulk). Useful for backups and long-tail data where waiting an hour to retrieve is acceptable.

Glacier Deep Archive. Around 0.099 cents per gigabyte per month, the cheapest tier on AWS. Retrieval takes 12 hours or more in the standard tier, faster with Bulk Expedited. The right tier for data kept solely for compliance, where the question “do you have it” matters and “how fast can we get it” does not.

The pricing structure is intentional: the cheaper the per-month rate, the higher the per-retrieval cost and the longer the minimum duration. The arithmetic that determines the right tier for a given dataset is “what is the expected retrieval frequency, and at what frequency does the retrieval cost exceed the storage savings”.

Lifecycle policies

S3 lifecycle policies are the automation that makes tiering practical. A lifecycle rule says, in declarative form, “move objects in this prefix from Standard to Standard-IA after 30 days, to Glacier Instant Retrieval after 90 days, to Deep Archive after 365 days, and delete after 7 years”. The rule runs on its own schedule; the team does not have to write code to migrate objects between tiers.

flowchart LR
    Hot["Hot data<br/>0-30 days<br/>Standard"] --> Warm["Warm data<br/>30-90 days<br/>Standard-IA"]
    Warm --> Cold["Cold data<br/>90-365 days<br/>Glacier IR"]
    Cold --> Frozen["Archive<br/>1-7 years<br/>Glacier Deep"]
    Frozen --> Deleted["Deleted<br/>after 7 years"]

The lifecycle policy is the single highest-ROI cost intervention available on a typical S3 bill, because it requires no application changes and runs forever once configured. Most teams that have not been through a FinOps engagement have either no lifecycle policies or default policies that move objects to IA after 30 days and never further. The disciplined version walks data through three or four tiers based on access measurements, and deletes data that has aged past its retention requirement.

A common configuration that captures most of the savings on an analytics bucket:

Days 0 to 30: Standard. Hot, frequently accessed, queried by the latest dashboards.
Days 30 to 90: Standard-IA. Warm, occasionally queried for month-over-month comparisons.
Days 90 to 365: Glacier Instant Retrieval. Cold, queried for the rare quarterly or compliance pull.
Days 365 onward: Glacier Deep Archive. Frozen, kept for the legal retention period and unlikely to be retrieved.
After the retention period: deleted.

The exact thresholds depend on the workload. A team with daily SLAs on data freshness probably extends the Standard window. A team with quarterly reporting cycles probably extends the Standard-IA window so the quarterly query still hits the millisecond-access tier. The pattern is to measure first, set thresholds, and revisit annually.

The Glacier retrieval-cost trap

The most common way a tiering strategy goes wrong is moving data to Glacier and then needing to retrieve it. Glacier Flexible Retrieval and Deep Archive both have substantial per-GB retrieval charges and request charges, plus the 90-day or 180-day minimum storage duration. A team that moves a terabyte to Deep Archive and retrieves half of it a week later pays the storage at the Deep Archive rate, the early-deletion fee for retrieving before the minimum duration, the per-GB retrieval charge, and the per-GB egress charge if the data leaves AWS. The total can exceed what the data would have cost on Standard for the same period.

The pattern that triggers this: a regulatory requirement comes in, the team realises the archived data needs to be produced for an audit, and the retrieval cost is discovered after the fact. The fix is to model retrieval cost into the tiering decision before configuring the policy. A useful rule of thumb: if a dataset has more than a 5 percent chance of being retrieved per year, Glacier Instant Retrieval or Standard-IA is usually a better fit than Deep Archive.

Parquet compaction

Storage-class tiering attacks the per-gigabyte side of the bill. File-count attacks the per-request side. The two are related: a hot-tier dataset of a million tiny Parquet files costs nearly the same in storage as a few large ones, but the request count, the listing time, and the query overhead make the small-files version dramatically more expensive to use.

The “small files problem” arises naturally in streaming ingestion and incremental ETL. Each micro-batch writes its own Parquet file. After a year, the table consists of millions of files averaging a few hundred kilobytes each. Each query has to list and open every file, parse its metadata, and merge the results. Spark and Trino both have well-documented degradation curves once the file count crosses a few hundred thousand. Beyond a million, query times become unpredictable and request charges become a noticeable line item on the bill.

The fix is compaction: a periodic job that reads many small files, sorts and bins the rows appropriately, and writes them out as fewer larger files. The target file size for analytics workloads is typically 256 MB to 1 GB per Parquet file. Compaction trades off the cost of running the job against the savings from reducing file count and metadata overhead, and for any dataset queried daily the savings dominate within weeks.

flowchart LR
    Tiny["10,000 small Parquet files<br/>~100 KB each<br/>1 GB total"] --> Compact["Compaction job<br/>read, merge, rewrite"]
    Compact --> Large["4 large Parquet files<br/>~256 MB each<br/>1 GB total"]

The compaction job itself can be expensive, because it reads and rewrites the same data. The economics work because the savings are on every subsequent query, not just the next one. A daily compaction job that runs for an hour saves seconds on every query for the next year, plus reduces the per-request cost on every read.

Automatic optimisation in Iceberg and Delta

Lesson 37 introduced lakehouse table formats. Their automatic optimisation features are the modern answer to the small-files problem, and they fold compaction into the format itself rather than leaving it to the user.

Iceberg. The Iceberg specification supports rewrite operations as first-class concepts, and engines like Spark and Trino can run OPTIMIZE or rewrite_data_files operations against an Iceberg table to compact small files, sort by clustering keys, and prune deleted rows. Snowflake and Databricks both expose managed Iceberg with automatic optimisation as a service feature.

Delta Lake. Delta has the OPTIMIZE command, with optional ZORDER BY clauses for multi-dimensional clustering. Databricks runtime offers auto-compaction and adaptive file sizing, where the engine itself decides when to compact based on the file-size distribution it observes. The auto-compaction feature is documented in the Databricks docs (https://docs.databricks.com/en/delta/optimizations/auto-optimize.html, retrieved 2026-05-01).

Hudi. Hudi has clustering and compaction operations, with both inline (during writes) and offline (scheduled) modes. Hudi was the format most explicit about treating compaction as a first-class concern, because the merge-on-read model effectively requires it.

For a team using one of these formats, automatic optimisation should be turned on rather than off. The cost of running compaction on a schedule is small compared to the cost of querying a million tiny files for a year, and the engineering time saved by not building bespoke compaction is meaningful.

Delete what you don’t need

The cheapest gigabyte is the one not stored. Every storage discussion eventually comes back to the disciplined practice of deleting data that has aged past its retention requirement, was generated by experiments that never went anywhere, lives in development environments nobody uses, or was copied “just in case” and never read again.

The reasons this gets neglected are predictable. Deleting feels risky: somebody might need the data tomorrow, and explaining a missing file is harder than absorbing a small line item on a bill. Deletion is not a feature any team is paid to ship. The cost of the kept data is distributed across the storage line item, where no individual gigabyte is visible. And a team that has not been through a cost engagement does not have the muscle memory to ask “should this still exist”.

The mature discipline includes:

Retention policies on every dataset, set by the data owner, expressed as a lifecycle rule with a final delete action.
A cleanup pass on dev and staging environments, where snapshots, EBS volumes, S3 buckets, and database backups accumulate until somebody runs an audit.
Bucket-level reviews on a quarterly cadence, looking for buckets nobody owns and prefixes that have not been written to in years.
Versioning hygiene: S3 buckets with versioning enabled accumulate every previous version of every object, and a missing lifecycle rule on the noncurrent versions can double or triple the bill silently. The fix is a policy that expires noncurrent versions after a defined window.

Storage cost is the rare engineering problem where doing nothing has a cost that compounds. The bucket grows, the bill grows, and the cost of the audit that should have happened last year accumulates as the next year of unnecessary spend.

What the next lesson covers

Lesson 67 takes the compute axis. Spot and preemptible instances for the workloads that tolerate interruption, autoscaling done well versus the autoscaling pitfalls that produce thrash and missed peaks, right-sizing as the basic discipline that most teams have not done in a year, and reserved capacity for the predictable baseline that justifies a multi-year commitment. Compute is the line item with the most folklore around it, and the lesson cuts through to the levers that actually move the bill.

Citations and further reading

AWS, S3 pricing, https://aws.amazon.com/s3/pricing/ (retrieved 2026-05-01). The canonical pricing reference for all storage classes referenced above.
AWS, S3 storage classes, https://aws.amazon.com/s3/storage-classes/ (retrieved 2026-05-01). The marketing-page summary of the classes, useful for the access-pattern guidance.
AWS, S3 lifecycle configuration, https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html (retrieved 2026-05-01). The reference for lifecycle policy syntax and behaviour.
Databricks, Auto-optimize on Delta Lake, https://docs.databricks.com/en/delta/optimizations/auto-optimize.html (retrieved 2026-05-01). The documentation for auto-compaction and optimised writes in Delta.
Apache Iceberg, table maintenance documentation, https://iceberg.apache.org/docs/latest/maintenance/ (retrieved 2026-05-01). The reference for rewrite_data_files, snapshot expiration, and other maintenance operations.
Apache Hudi, compaction documentation, https://hudi.apache.org/docs/compaction/ (retrieved 2026-05-01). Reference for inline and scheduled compaction in Hudi.
FinOps Foundation, “Storage cost optimization”, https://www.finops.org/wg/storage-cost-optimization/ (retrieved 2026-05-01). Working-group resources on storage tiering practices in the FinOps community.