Compute cost optimization: spot, autoscaling, right-sizing

Lesson 65 framed cost as an iceberg, with compute as the visible part everyone watches. Lesson 66 went deep on storage. This lesson covers the compute side: the three operational levers that, for most teams, account for the majority of compute savings. Spot and preemptible instances for the workloads that can tolerate interruption. Autoscaling that responds to actual load without thrashing or missing peaks. Right-sizing the VMs that have been over-provisioned since the last capacity panic three years ago.

The lesson also touches on reserved capacity (Reserved Instances and Savings Plans on AWS, Committed Use Discounts on GCP, Reservations on Azure), which is a contractual lever rather than an architectural one but compounds with the architectural work. And it closes on the Kubernetes-specific autoscaling patterns (Cluster Autoscaler, Karpenter) that most modern teams will encounter sooner or later.

Lever one: Spot and preemptible instances

The cloud providers operate at scale, and at any given moment they have spare capacity that has not been allocated to on-demand customers. They sell that spare capacity at a steep discount, with the catch that the provider can reclaim it at short notice when on-demand demand rises.

AWS calls these Spot Instances, with savings of 60 to 90 percent off on-demand rates depending on the instance type and region. Two-minute reclaim warning. GCP has Spot VMs (formerly Preemptible VMs) at 60 to 91 percent off, with a 30-second warning. Azure has Spot Virtual Machines with similar pricing and a 30-second warning. The economics across the three are close enough that the choice is usually driven by what the rest of the workload runs on rather than by Spot pricing in isolation.

The headline savings (60 to 90 percent) are real, sustained, and substantial. A team running a batch ETL workload that can fit on Spot pays a fraction of the bill of a team running the same workload on on-demand. The framing question is not “should we use Spot” but “which of our workloads tolerate interruption”.

Workloads that fit Spot well.

Batch jobs that are idempotent and retryable. A nightly ETL that can be re-run after an interruption is a textbook fit.
Spark workers (executors). The driver typically runs on on-demand or a stable instance, but executors can be Spot. Spark’s task-retry mechanism handles executor loss gracefully.
Distributed ML training with checkpointing. A training job that checkpoints every few minutes loses at most a few minutes of progress when reclaimed.
Stateless web workers behind a load balancer with a healthy instance pool. Lose 20 percent of capacity briefly, the autoscaler brings replacements up.
CI/CD runners. A build that gets interrupted retries on a different runner.

Workloads that do not fit Spot.

Stateful single-instance services. Loss of the instance loses state.
Leader nodes in distributed systems (Kafka controllers, Spark drivers, ZooKeeper, etcd). A leader-election loss is recoverable but expensive, and Spot reclaim is too frequent to absorb it cleanly.
Long-running interactive jobs without checkpointing. Reclaim halfway through means starting over.
Latency-sensitive services with strict tail-latency SLOs. The reclaim event itself causes a latency spike during the failover.

The mature pattern is a mixed fleet: an on-demand or Reserved baseline for the workloads that need stability, with Spot for the rest. Auto Scaling Groups on AWS support this directly through “mixed instances policies”, letting the team specify a base capacity on on-demand with the rest filled from Spot pools. Karpenter (covered below) goes further and lets the cluster autoscaler choose Spot or on-demand on a per-pod basis based on declared tolerances.

Lever two: Autoscaling

Most workloads do not have constant load. Web traffic peaks in the afternoon and dips overnight. Batch jobs run in waves. Analytical queries cluster around business hours. A static fleet sized for peak load is idle most of the time, paying for capacity it does not use.

Autoscaling adjusts the fleet size to match the observed load. Done right, the team pays for what it uses and nothing more. Done badly, autoscaling produces a worse bill than a static fleet, because it thrashes between scale-up and scale-down operations, fails to respond fast enough to traffic spikes (causing dropped requests and SLO violations), or scales prematurely on noisy signals.

The mechanics are straightforward in concept. Define a metric the autoscaler watches (CPU utilisation, request rate, queue depth). Define thresholds at which it adds or removes capacity. The autoscaling controller checks the metric on an interval and adjusts the fleet size.

The pitfalls are subtler than the mechanics suggest.

Oscillation. Scale up at 70 percent CPU, scale down at 30 percent. Add an instance at 70, the average drops to 40, the controller scales down, the average climbs back to 70, the controller scales up again. The fleet ping-pongs between sizes, paying for instance start-up and shutdown overhead repeatedly. The fix is a hysteresis gap (scale up at 70, scale down at 40) and a cooldown period (after any scaling action, wait 10 minutes before considering another).

Slow scale-up. A traffic spike hits, the autoscaler decides to add capacity, AWS provisions an instance over a couple of minutes, the instance boots and runs its initialisation script over another minute, the load balancer marks it healthy after a few more seconds, and only then does it start serving traffic. Total time from spike to capacity, around five minutes. If the spike is shorter than five minutes, the autoscaler is closing the barn door after the horse left. The fix is overcapacity headroom (target 60 percent rather than 80 percent), warm pools (pre-initialised instances ready to serve), and predictive scaling that looks at historical patterns and scales preemptively.

Scale-down before load drops. The reverse failure: load is dropping, the autoscaler scales down based on the immediate signal, then load comes back and the team is short on capacity. The fix is asymmetric scaling: scale up aggressively, scale down slowly. A typical configuration scales up within 60 seconds of crossing the upper threshold and waits 10 to 15 minutes of sustained low utilisation before scaling down.

Cold-start latency. Even when capacity is available, a freshly started container or VM is slow on its first few requests because the JIT compiler, caches, and connection pools are not warmed. The team that ignores cold-start sees tail latency spike every time the autoscaler adds capacity. The fix is application-level (warm-up endpoints, pre-warmed connection pools) plus platform-level (gradual traffic ramping for new instances).

Thrashing on noisy metrics. CPU utilisation is noisy on small instances and bursty workloads. Scaling decisions made on the 10-second average produce more thrash than scaling on the 5-minute average. The fix is metric smoothing (use 5-minute averages, weight recent observations slightly higher) and consider scaling on more stable signals (queue depth, request rate) for workloads where CPU is too noisy.

The shape of a well-tuned autoscaling configuration: target utilisation around 60 to 70 percent, asymmetric scaling rules, hysteresis between up and down thresholds, a cooldown period, and predictive scaling layered on top for workloads with strong daily patterns. Done right, autoscaling can cut a fleet bill by 40 to 60 percent compared to peak-sized static provisioning.

Lever three: Right-sizing

Most VMs are oversized. The instance type was chosen during the initial deployment based on guesses about load, padded for safety, and never revisited. The application now uses 15 percent of the CPU and 30 percent of the memory on average, and the team is paying for 100 percent of the instance.

The exercise is straightforward and ought to be a quarterly discipline.

flowchart LR
    Measure["Measure utilisation<br/>CPU, memory, disk, network<br/>over a representative week"] --> Recommend["Pick instance type<br/>with 30-40 percent headroom<br/>over observed peak"]
    Recommend --> Test["Test in staging<br/>under production-like load"]
    Test --> Deploy["Roll out gradually<br/>monitor for regressions"]
    Deploy --> Repeat["Revisit quarterly"]
    Repeat --> Measure

The measurement window has to capture the workload’s natural cycle. A web service with weekly patterns needs a week of data. A batch job with monthly cycles needs a month. The peak observed during the window, plus a 30 to 40 percent safety margin, defines the smallest instance type that fits.

The 30 to 40 percent headroom is not arbitrary. It accounts for measurement uncertainty, future growth, and the unpredictable spikes that the measurement window did not happen to capture. A team that right-sizes to 5 percent headroom will spend the next quarter firefighting the instances that hit the ceiling on a busy day. A team that right-sizes to 70 percent headroom is paying for capacity it will not use.

Both AWS and GCP ship right-sizing recommenders that automate the measurement and the recommendation. AWS Compute Optimizer (https://aws.amazon.com/compute-optimizer/, retrieved 2026-05-01) analyses CloudWatch metrics for EC2, EBS, Lambda, ECS, and RDS, and produces recommendations for smaller, larger, or different-family instance types. GCP Recommender (https://cloud.google.com/recommender, retrieved 2026-05-01) does the equivalent for Compute Engine, with similar coverage on the rest of the GCP fleet. Azure has Azure Advisor with a cost-optimisation category. None of the three is a substitute for engineering judgement, but they remove most of the data-collection work and give the team a starting point for the discussion.

Right-sizing is the rare cost intervention with no architectural cost. The application does not change. The instance is replaced with a smaller one of the same family, the workload runs the same code, the user-visible behaviour is identical, and the bill drops. The reason teams do not right-size more often is not that it is hard; it is that there is rarely a forcing function. A quarterly cost review with right-sizing as a standing agenda item is the practice that makes it routine.

Reserved capacity and committed use discounts

Reserved Instances (AWS) and Savings Plans (AWS), Committed Use Discounts (GCP), and Reservations (Azure) are the contractual instruments that exchange a multi-year commitment for a discount of 30 to 60 percent off on-demand rates. The mechanics differ across the providers; the framing is consistent.

The basic argument: a workload that is going to run continuously for the next one to three years anyway can be paid for upfront at a steep discount. The risk is that the workload changes (gets retired, migrates to a different instance type, moves to a different region) before the commitment matures, leaving the team paying for capacity it cannot use.

The discipline is to measure the predictable baseline first. The 30-day rolling minimum of compute usage, in instance-hours, is a reasonable proxy for “the capacity the team uses no matter what”. Reserve up to that baseline. Leave the variable portion above it on on-demand or Spot. The classic mistake is over-reserving (locking in 80 percent of current usage, then discovering current usage included a deprecated workload that gets retired, leaving the reservation orphaned) or under-reserving (locking in 20 percent and paying full price on the predictable portion above).

AWS Savings Plans, introduced in 2019, made the calculus easier than the older Reserved Instance model: a Savings Plan commits to a dollar-per-hour spend rather than to a specific instance type, and the discount applies flexibly across instance families and sizes. As of 2026, Savings Plans are usually the right starting point for a team new to reserved capacity on AWS.

The combined pattern: Reserved or committed capacity for the predictable baseline, on-demand for the predictable peaks above the baseline, Spot for the interruptible workloads. A typical mature data platform has all three layers running together, with the proportions tuned to the workload mix.

Kubernetes-specific autoscaling

For teams running on Kubernetes, the autoscaling story has two layers. Pod-level autoscaling (Horizontal Pod Autoscaler, Vertical Pod Autoscaler) decides how many pod replicas to run and how much CPU and memory each pod requests. Cluster-level autoscaling decides how many underlying VMs to run to host those pods.

The Cluster Autoscaler has been the default cluster-level autoscaler since the early Kubernetes years. It watches for unschedulable pods and adds nodes from configured node pools, and it watches for underutilised nodes and removes them. It works well, but it is constrained by the node-pool model: each pool is a fixed instance type, and the autoscaler can only choose how many of that type to run.

Karpenter, introduced by AWS in 2021 and now a CNCF project, is the modern alternative. Instead of fixed node pools, Karpenter looks at the pending pods, computes the best-fit instance type for each batch, and provisions it directly. The result is denser packing, faster scale-up (Karpenter can launch nodes in around a minute compared to several minutes for the Cluster Autoscaler), and lower cost because the instance choice is workload-aware. Karpenter also handles Spot integration natively, with declared tolerances that let pods opt in to Spot pools.

For a team on EKS in 2026, Karpenter is the recommended choice. The Cluster Autoscaler still works and is still appropriate for teams already running it stably. Greenfield clusters should default to Karpenter. The Karpenter documentation (https://karpenter.sh/, retrieved 2026-05-01) covers the setup and the patterns.

Putting it together

The combined cost picture for a mature data team running on AWS in 2026:

Predictable baseline: Reserved Instances or Savings Plans for 30 to 60 percent off on-demand.
Variable peak: on-demand, sized through autoscaling that targets 60 to 70 percent utilisation.
Interruptible workloads: Spot, integrated through mixed-instances policies or Karpenter.
Right-sizing: quarterly review with Compute Optimizer recommendations as a starting point.

A team that runs all four levers well typically captures 50 to 70 percent savings against an unoptimised compute bill. A team that runs none of them is leaving most of the available savings on the table. Most teams sit somewhere in between, with one or two levers in play and the others on the backlog.

The thread that ties Module 9’s first three lessons together: cost is not a one-time project. It is a discipline. Lesson 65 framed the iceberg. Lesson 66 covered the storage levers. This lesson covered compute. The next lessons in the module move into query-level optimisation, architectural cost patterns (multi-region versus single-region, microservices traffic costs, serverless versus provisioned trade-offs), and the cultural practices that keep cost a shared concern alongside reliability and performance.

Citations and further reading

AWS, EC2 Spot Instances, https://aws.amazon.com/ec2/spot/ (retrieved 2026-05-01). Pricing model, interruption mechanics, and the integration patterns with Auto Scaling Groups.
GCP, Spot VMs, https://cloud.google.com/compute/docs/instances/spot (retrieved 2026-05-01). Pricing and preemption mechanics on GCP.
AWS Compute Optimizer, https://aws.amazon.com/compute-optimizer/ (retrieved 2026-05-01). Right-sizing recommendations across EC2, EBS, Lambda, ECS, and RDS.
GCP Recommender, https://cloud.google.com/recommender (retrieved 2026-05-01). The equivalent recommendation service on GCP.
AWS, Savings Plans, https://aws.amazon.com/savingsplans/ (retrieved 2026-05-01). The flexible commitment-based discount mechanism.
Karpenter, https://karpenter.sh/ (retrieved 2026-05-01). The modern Kubernetes node autoscaler, with documentation on installation, NodePool configuration, and Spot integration.
Kubernetes Autoscaler project, https://github.com/kubernetes/autoscaler (retrieved 2026-05-01). The Cluster Autoscaler and Vertical Pod Autoscaler reference implementations.
FinOps Foundation, “Rate optimisation” working group, https://www.finops.org/ (retrieved 2026-05-01). Community resources on Spot, reserved capacity, and right-sizing as part of the broader FinOps practice.