1A Kubernetes HPA is configured with a target CPU utilization of 50%. Current CPU is at 80% across 3 pods. How many pods will HPA scale to (assuming no max limit)?
Learn how auto-scaling dynamically adjusts compute resources based on real-time demand, reducing costs during quiet periods and maintaining performance during traffic spikes.
Auto-scaling is the ability of a system to automatically adjust the number of active compute resources based on current demand. Elasticity is the closely related property of a system that can both scale out (add resources) and scale in (remove resources) in response to workload changes. Together, these capabilities form the foundation of cost-efficient, high-performance cloud architectures.
Without auto-scaling, teams must provision infrastructure for peak load at all times. If your application receives 10x traffic during business hours compared to overnight, you either waste money running 10x capacity around the clock, or you risk degraded performance during peaks. Auto-scaling solves this by continuously monitoring key metrics -- CPU utilization, request queue depth, response latency, or custom application metrics -- and adjusting the number of running instances accordingly.
Modern auto-scaling systems operate on policies that define scaling triggers, cooldown periods, and boundaries. A typical policy might state: 'When average CPU utilization across the group exceeds 70% for 2 consecutive minutes, add 2 instances. When it drops below 30% for 5 minutes, remove 1 instance. Never scale below 2 instances or above 50 instances.' The asymmetric cooldown periods (fast scale-out, slow scale-in) prevent oscillation where the system rapidly adds and removes instances in response to fluctuating load.
Predictive auto-scaling extends this concept by using historical traffic patterns and machine learning to anticipate demand before it arrives. If your application consistently sees a traffic spike at 9 AM on weekdays, predictive scaling begins adding instances at 8:45 AM rather than waiting for the spike to trigger reactive thresholds. This eliminates the cold-start latency that can degrade performance during the initial minutes of a traffic surge.
The Elevator Bank Analogy
Imagine a building with 10 elevators, but only 3 are active during normal hours. During the morning rush (8-9 AM) and evening rush (5-6 PM), sensors detect long wait times and automatically activate more elevators. As the rush subsides, elevators that have been idle for several minutes are deactivated to save energy. On special event days, the building manager schedules all 10 elevators to be active from the start (predictive scaling). The building never runs all 10 elevators 24/7 (over-provisioning) or only 3 during rush hour (under-provisioning).
Amazon (AWS)
Amazon pioneered auto-scaling for both its own e-commerce platform and as a cloud service. During Prime Day, Amazon auto-scales its infrastructure by orders of magnitude, spinning up thousands of additional EC2 instances, expanding DynamoDB capacity, and pre-warming CloudFront CDN edges. Their auto-scaling policies combine predictive models trained on previous Prime Day data with reactive triggers for unexpected demand patterns.
Uber
Uber uses auto-scaling to handle the dramatic demand swings in ride-hailing. Weekend nights and major events can see 5-10x the traffic of a quiet weekday afternoon. Their microservices auto-scale independently -- the matching service might need 4x more instances during peak while the receipt-generation service only needs 2x. Custom metrics like ride-request queue depth drive scaling decisions rather than generic CPU metrics.
Spotify
Spotify auto-scales its backend services using Kubernetes Horizontal Pod Autoscaler (HPA). When a major album drops (like a new Taylor Swift release), traffic to the streaming, search, and recommendation services spikes within minutes. Their auto-scaling policies use a combination of request-per-second metrics and p99 latency thresholds, with aggressive scale-out and gradual scale-in to handle the asymmetric traffic pattern.
| Aspect | Description |
|---|---|
| Responsiveness vs Stability | Aggressive auto-scaling thresholds respond quickly to demand changes but risk oscillation (rapid scale-out/scale-in cycles). Conservative thresholds are more stable but may result in degraded performance during sudden spikes while waiting for scaling to trigger. |
| Cost vs Performance | Auto-scaling reduces costs compared to static provisioning, but maintaining a minimum capacity floor for instant responsiveness costs more than scaling to zero. The optimal minimum is the capacity needed to handle the cold-start period before auto-scaling kicks in. |
| Cold Start Latency | New instances take time to boot, load application code, warm caches, and establish database connections. During this warm-up period, they cannot handle full load. Pre-warming strategies mitigate this but add complexity to deployment and scaling workflows. |
| Predictive Accuracy | Predictive auto-scaling works well for regular patterns (daily cycles, weekly trends) but fails for unpredictable events (viral content, breaking news, flash sales). A hybrid approach using both predictive and reactive scaling provides the best coverage. |
Spotify's Auto-Scaling for Album Launches
Scenario
When a major artist releases a new album on Spotify, the platform experiences a traffic surge that can reach 3-5x normal peak load within minutes. Before implementing auto-scaling, Spotify engineers had to manually pre-provision extra capacity hours before a known release, which was wasteful for smaller releases and insufficient for unexpectedly popular ones. Unknown releases that went viral caused service degradation because no pre-provisioning had been done.
Solution
Spotify implemented a multi-layered auto-scaling strategy on Kubernetes. The base layer uses HPA with request-per-second and latency metrics for reactive scaling. A second layer uses predictive scaling based on calendar events (known release dates) and social media signal analysis to pre-provision capacity. A third layer provides burst capacity through cloud provider spot/preemptible instances that can be acquired cheaply for short-duration traffic spikes. All three layers operate simultaneously, with the predictive layer handling expected surges and the reactive layer catching unexpected ones.
Outcome
The auto-scaling system reduced infrastructure costs by approximately 45% compared to static peak provisioning while maintaining p99 latency SLOs during all but the most extreme traffic events. Engineers no longer needed to manually intervene for release-day scaling, and the system handled unexpected viral moments (like a song trending on TikTok) without advance preparation. The cold-start problem was mitigated by maintaining a warm pool of pre-initialized containers ready to accept traffic within seconds.
See Auto-Scaling & Elasticity in action
Explore system design templates that use auto-scaling & elasticity and run traffic simulations to see how these concepts perform under real load.
Browse Templates1A Kubernetes HPA is configured with a target CPU utilization of 50%. Current CPU is at 80% across 3 pods. How many pods will HPA scale to (assuming no max limit)?
2Your auto-scaling group oscillates between 4 and 12 instances every few minutes during steady traffic. What is the most likely cause?