Vetora logo
💰Interview Toolkit

Cost Questions in System Design Interviews

Modern system design interviews now include cost estimation. Learn cloud pricing models, the cost estimation framework (identify the driver, estimate monthly cost, optimize), and why 'cost per request' is the key metric for microservices.

Overview

Cost awareness has become a standard expectation in system design interviews, especially at the senior level and above. The era of 'scale at any cost' has given way to an industry-wide focus on efficient infrastructure spending, driven by the FinOps movement, high-profile cost optimization projects at major tech companies, and the reality that cloud bills are often the second-largest operational expense after headcount. Interviewers who ask cost questions are evaluating whether you can design systems that are not only functionally correct and performant but also economically sustainable.

Cloud cost categories break down into four major areas. Compute covers the cost of running code -- EC2 instances (priced by instance type and hours), Lambda functions (priced per invocation and GB-seconds of execution), and container services like ECS/EKS (priced by underlying compute). Storage covers data at rest -- S3 (priced per GB stored and per request, with tiered pricing for infrequent access and archival), EBS volumes (priced per GB provisioned), and database storage (priced per GB with additional IOPS charges for provisioned-performance databases). Network covers data in motion -- egress charges (data leaving AWS is approximately $0.09/GB, while data entering is free), cross-AZ transfer ($0.01/GB each way), and cross-region transfer ($0.02/GB). Managed services add a premium above raw compute and storage -- RDS costs 30-50% more than self-hosted MySQL on the same hardware, but eliminates DBA overhead; DynamoDB on-demand mode is more expensive per request than provisioned mode but handles traffic spikes without capacity planning.

The cost estimation framework has three steps. First, identify the cost driver: for most systems, it is either storage (data-heavy systems like video platforms, data lakes) or compute (processing-heavy systems like real-time analytics, ML inference). Second, estimate the monthly cost using known pricing: if you store 100 TB in S3 Standard, that is roughly $2,300/month for storage plus request costs. If you run 10 c5.2xlarge instances 24/7, that is about $2,400/month on-demand. Third, optimize: reserved instances save 30-60% for stable workloads, spot instances save 60-90% for fault-tolerant batch processing, S3 Intelligent-Tiering automatically moves objects to cheaper tiers, and right-sizing instances based on actual utilization often cuts costs by 30%.

The 'cost per request' metric is increasingly important for microservice architectures. If your order service costs $5,000/month and handles 10 million requests, its cost per request is $0.0005. If the recommendation service costs $15,000/month for 1 million requests, its cost per request is $0.015 -- 30x more expensive per request. This metric reveals which services are candidates for optimization and helps you make architectural decisions: should a feature be a synchronous call to an expensive ML service, or a pre-computed result stored in a cheap cache?

Key Points
  • 1Cloud costs divide into four categories: compute (EC2, Lambda, containers), storage (S3, EBS, database), network (egress, cross-AZ, cross-region), and managed service premiums. Knowing approximate unit prices for each lets you estimate costs during an interview.
  • 2Network egress is the hidden cost trap. Data entering AWS is free, but data leaving costs approximately $0.09/GB. A CDN serving 1 PB/month of video incurs roughly $90,000 in egress charges alone. Always estimate egress when designing systems with high outbound bandwidth.
  • 3The cost estimation framework: (1) identify the cost driver (storage or compute), (2) estimate monthly cost using known unit prices, (3) optimize with reserved instances, spot, tiered storage, or right-sizing. This three-step approach works for any system.
  • 4Reserved instances save 30-60% over on-demand for predictable workloads. Spot instances save 60-90% for fault-tolerant batch jobs. Savings Plans offer commitment-based discounts with more flexibility. Know these options and when each applies.
  • 5Cost per request is the key metric for microservices. It normalizes infrastructure spend across services with different traffic volumes and reveals which services are disproportionately expensive to operate, guiding optimization priorities.
  • 6Managed services trade money for operational simplicity. RDS costs more than self-hosted PostgreSQL but eliminates patching, backups, and failover management. The right choice depends on team size and expertise -- a small team should pay the premium; a large platform team may save millions by self-hosting.
Simple Example

The Electricity Bill Analogy

Cloud cost management is like managing your home electricity bill. You pay for what you use (on-demand pricing), you can pre-pay for a year at a discount (reserved instances), and you can reduce consumption with efficiency measures (right-sizing, turning off unused resources). The biggest surprise on the bill is usually something you did not expect -- like an old space heater running 24/7 in the garage (an unmonitored service running oversized instances). The first step to reducing the bill is not switching providers -- it is understanding where the money goes. Similarly, the first step to cloud cost optimization is identifying the cost driver, not switching cloud providers.

Real-World Examples

Airbnb

Airbnb moved significant workloads from DynamoDB to a self-hosted database solution. DynamoDB's on-demand pricing was convenient during rapid growth, but at Airbnb's scale, the per-request cost exceeded what they would pay for dedicated infrastructure. By investing in a platform team to manage self-hosted databases, they reduced their database costs by over 50%. The lesson: managed services are cost-effective at moderate scale but can become the dominant cost driver at hyperscale.

Dropbox

Dropbox's Project Magic Pocket moved the majority of user data from Amazon S3 to custom-built infrastructure in leased data centers. At their scale (hundreds of petabytes), S3 storage and egress costs exceeded $75 million per year. Building and operating their own storage system required significant upfront investment but reduced ongoing storage costs by more than 50%. This is the most dramatic example of the 'build vs buy' cost trade-off in cloud infrastructure.

Discord

Discord migrated from Cassandra to ScyllaDB for their message storage system. Cassandra's garbage collection pauses required over-provisioned nodes with large JVM heaps, driving up compute costs. ScyllaDB (a C++ rewrite of Cassandra) achieved the same throughput on fewer, smaller instances by eliminating GC overhead. The migration saved millions annually in compute costs while actually improving tail latency. This illustrates how technology choice directly impacts infrastructure cost.

Trade-Offs
AspectDescription
Managed Service Premium vs Operational CostManaged services (RDS, DynamoDB, Aurora) cost 30-100% more than self-hosted equivalents but eliminate operational burden. For a startup with 2 engineers, the operational savings justify the premium. For a platform team of 50 at a large company, self-hosting can save millions annually. The break-even point depends on team size, expertise, and scale.
Reserved Instances vs FlexibilityReserved instances save 30-60% but lock you into specific instance types and regions for 1-3 years. If your workload changes (migration to containers, traffic shift to a different region), you pay for unused reservations. Savings Plans offer more flexibility but still require a commitment. The trade-off is cost savings versus ability to adapt.
Multi-Region vs CostMulti-region deployment improves latency and availability but doubles or triples infrastructure cost. Cross-region data replication adds network transfer charges. A system that costs $10,000/month in a single region might cost $25,000/month across three regions. The decision depends on whether the latency improvement and availability gain justify the cost increase.
Performance Optimization vs Infrastructure CostCaching reduces database load and improves latency but adds the cost of cache infrastructure (Redis/Memcached instances). CDNs reduce origin bandwidth but charge per request and per GB served. The optimization is only cost-effective if the saved infrastructure cost exceeds the cost of the optimization layer itself.
Case Study

Optimizing a Video Streaming Platform's Cloud Costs

Scenario

A mid-size video streaming platform was spending $500,000/month on AWS infrastructure. The engineering team was asked to reduce costs by 40% without degrading user experience. The cost breakdown was: 45% compute (transcoding and serving), 30% storage (S3 for video files), 15% network (CDN egress), and 10% managed services (RDS, ElastiCache). The platform served 5 million daily active users with 2 PB of stored video content.

Solution

The team applied the cost estimation framework to each category. For compute (45%, $225K/month): they moved transcoding to spot instances (saving 70% on batch processing) and right-sized serving instances based on utilization data (25% were over-provisioned). For storage (30%, $150K/month): they implemented S3 Intelligent-Tiering, which automatically moved videos not watched in 30 days to Infrequent Access tier (40% cheaper) and videos not watched in 90 days to Glacier Instant Retrieval (68% cheaper). For network (15%, $75K/month): they negotiated a CloudFront committed use discount and implemented more aggressive edge caching to reduce origin fetches. For managed services (10%, $50K/month): they moved from RDS Multi-AZ to Aurora Serverless v2 for the metadata database, which scaled to zero during off-peak hours.

Outcome

Monthly costs dropped from $500,000 to $285,000, a 43% reduction. The largest savings came from spot instances for transcoding ($90K saved), S3 tiering ($60K saved), and instance right-sizing ($35K saved). Critically, none of these optimizations affected user-facing performance -- video startup time and rebuffering rates remained unchanged. The project demonstrated that cost optimization is primarily an engineering problem, not a business constraint, and that understanding cloud pricing models is a core engineering skill.

Common Mistakes
  • Ignoring network egress costs. Data transfer out of AWS is approximately $0.09/GB, which is invisible until the bill arrives. A system serving 100 TB/month of video incurs $9,000 in egress alone. Always estimate egress for systems with high outbound data.
  • Comparing only the sticker price of managed services. RDS costs more per hour than a self-hosted EC2 database, but the comparison must include the engineering time for patching, backups, monitoring, and failover. For most teams, managed services are cheaper when total cost of ownership is considered.
  • Not knowing approximate cloud pricing. In a 2026 interview, saying 'I do not know what S3 costs' is like saying 'I do not know what a hash map is' in a coding interview. Memorize the order of magnitude: S3 Standard is roughly $0.023/GB/month, EC2 m5.xlarge is roughly $0.19/hour, Lambda is $0.20 per million invocations.
  • Optimizing the wrong cost category. If 60% of your bill is compute, spending weeks optimizing a storage strategy that is 10% of the bill yields minimal return. Always identify the cost driver first and optimize the largest category.
Related Concepts

See Cost Questions in System Design Interviews in action

Explore system design templates that use cost questions in system design interviews and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Estimate cloud infrastructure costs at different scale tiers

Metrics to watch
monthly_cost_usdcost_per_requestresource_utilization_pctscaling_cost_curve
Run Simulation
Test Your Understanding

1Which of the following is typically the most surprising hidden cost in cloud infrastructure?

2What is the 'cost per request' metric and why is it useful for microservices?

3Why did Dropbox move from S3 to custom-built storage (Project Magic Pocket)?

Deeper Reading