Vetora logo
🏗️Trade-Off Deep Dives

Build vs Buy

The build vs buy decision determines whether to develop a system component in-house or adopt a third-party solution (SaaS, open-source, managed service). This trade-off affects engineering velocity, operational costs, competitive differentiation, and long-term flexibility. It is among the most impactful architectural decisions and appears frequently in staff-level system design interviews.

Overview

The build vs buy decision is deceptively complex. Engineers often default to building because custom solutions feel more powerful, more flexible, and more interesting. But building carries costs that are chronically underestimated: ongoing maintenance, on-call burden, feature parity with commercial solutions, security patching, documentation, and the opportunity cost of engineering time not spent on core product features. A home-built authentication system requires years of security hardening that Auth0 or Okta have already done. A custom monitoring system requires years of feature development that Datadog provides out of the box.

The core principle is: build what differentiates your business; buy everything else. Stripe builds its own payment processing infrastructure because that is their core product. But Stripe uses AWS (they do not build their own data centers), Bugsnag for error tracking, and PagerDuty for incident management. Google builds its own database (Spanner) because distributed data storage is core to their business. But even Google uses third-party tools for non-core functions. The question is not 'can we build it?' but 'should we spend our scarce engineering resources building it instead of working on our core product?'

Total Cost of Ownership (TCO) analysis over 3-5 years is the most rigorous way to evaluate build vs buy. Building costs include: initial development (engineer salaries for 3-12 months), ongoing maintenance (typically 20% of initial build cost per year), on-call rotation, infrastructure costs, security audits, feature development to keep pace with user needs, and opportunity cost of engineers not working on core features. Buying costs include: license fees, integration development, vendor management overhead, potential vendor lock-in switching costs, and limitations on customization. In most cases, the buy TCO is significantly lower for non-core components.

The third option -- adopting and operating open-source software -- is a middle ground that combines some advantages and disadvantages of both. Open-source gives you source code access (no vendor lock-in), community-driven feature development, and no license fees. But it requires operational expertise: you are responsible for deployment, upgrades, security patches, scaling, and troubleshooting. Running your own Kafka cluster or Elasticsearch cluster requires dedicated infrastructure expertise. Managed services (AWS MSK for Kafka, Elastic Cloud for Elasticsearch) shift the operational burden to the vendor at a higher monetary cost but lower engineering cost.

Key Points
  • 1Build what differentiates; buy everything else. Your engineering team's time is your scarcest resource. Every hour spent building a custom logging system is an hour not spent on features that distinguish your product from competitors. Authentication, monitoring, email delivery, and payment processing are commodities -- buy them.
  • 2Engineers systematically underestimate maintenance costs. The initial build is 20-30% of the total lifetime cost. Ongoing maintenance (bug fixes, security patches, feature additions, on-call, documentation) accounts for 70-80%. A custom solution that takes 6 months to build requires 1-2 dedicated engineers in perpetuity for maintenance.
  • 3Vendor lock-in is real but often overestimated. Teams avoid buying because of lock-in fears, then spend 12 months building an inferior custom solution. Switching costs are real, but most mature SaaS products have export capabilities and standard APIs. The cost of building is certain and immediate; the cost of switching vendors is uncertain and future.
  • 4Open-source is not free. The license is free, but operating, maintaining, upgrading, and troubleshooting open-source infrastructure requires significant engineering expertise. Running a production Kafka cluster or Kubernetes cluster is a full-time job. Managed services (AWS MSK, GKE) trade money for engineering time -- usually a good trade for non-core infrastructure.
  • 5Evaluate the build option against the best available buy option, not against a straw man. Compare your proposed custom search engine against Elasticsearch or Algolia, not against a hypothetical bad vendor. If the best vendor solution covers 90% of your requirements, building for the remaining 10% is rarely justified.
  • 6Revisit build vs buy decisions periodically. A component that made sense to build in 2020 (because no good vendor existed) may now have excellent SaaS options. Conversely, a bought solution that worked at small scale may not meet your needs at 100x scale, justifying a custom build.
Simple Example

Custom Auth System vs Auth0

A startup needs user authentication for their SaaS product. Option A (Build): 2 engineers spend 3 months building a custom auth system with password hashing, JWT tokens, OAuth integration, MFA, rate limiting, account lockout, password reset flows, and session management. Total initial cost: ~$150K in engineering time. Ongoing: 0.5 engineer for maintenance, security patches, and feature additions = ~$100K/year. After 3 years: ~$450K total. Option B (Buy Auth0): 1 engineer spends 2 weeks integrating Auth0. Auth0 cost at their scale: ~$30K/year. Integration maintenance: ~0.05 engineer/year = ~$10K/year. After 3 years: ~$150K total (including $90K Auth0 fees + $30K integration + $30K maintenance). The buy option costs 3x less AND those 2 engineers spend 3 months building product features instead of authentication. Auth0 also provides features the custom build would not: breached password detection, anomaly detection, compliance certifications (SOC 2, HIPAA) -- features that would cost years of additional development to build in-house.

Real-World Examples

Dropbox

Dropbox initially stored all files on Amazon S3 (buy). As they scaled to exabytes, S3 costs became a significant expense, and they needed performance optimizations that S3 could not provide. In 2015, they built their own storage system (Magic Pocket), migrating 90% of user data off S3. This massive build effort was justified because storage is Dropbox's core competency -- their entire product is file storage. The migration saved hundreds of millions of dollars annually. However, Dropbox continues to use AWS for non-core services: they buy their monitoring, CI/CD, and other commodity infrastructure.

Figma

Figma built a custom real-time collaborative editing engine (including a custom WebGL rendering engine and CRDT-based multiplayer system) because real-time collaboration is their core differentiator. However, they buy Stripe for payments, Datadog for monitoring, LaunchDarkly for feature flags, and Auth0 for authentication. This is a textbook application of 'build the core, buy the rest.' The custom collaboration engine took years to build and is what makes Figma uniquely valuable; the purchased components are commodities that multiple vendors provide well.

Segment

Segment started by building their own message queue for customer data routing, but eventually migrated to Amazon Kinesis and later Kafka. Their initial custom queue was a source of constant operational pain -- message loss bugs, scaling issues, and on-call incidents. They realized that message queuing was not their core differentiator (customer data integration was), and that Kafka provided better reliability, scalability, and ecosystem integration than they could build with a small team. The migration freed engineers to focus on connectors and data transformations -- their actual product.

Trade-Offs
AspectDescription
Customization vs Time-to-MarketBuilding gives you exactly what you need -- custom features, tight integration with your architecture, and full control over the roadmap. Buying gives you 80-90% of what you need in days or weeks instead of months. The gap between 'exactly what we want' and 'close enough' is rarely worth the 6-12 month delay. Ship with the bought solution now, and if the 10% gap becomes a real problem at scale, consider building then -- with the benefit of understanding your actual requirements rather than guessed ones.
Control vs Operational BurdenBuilding gives you full control: you can fix bugs immediately, prioritize features you need, and avoid depending on a vendor's roadmap. But this control comes with operational burden: on-call for the component, security patching, performance tuning, capacity planning, and documentation. Buying outsources this operational burden to a vendor with a dedicated team. For non-core components, the vendor's team is almost certainly larger and more experienced at operating that specific system than yours.
Cost at Small Scale vs Cost at Large ScaleSaaS pricing typically favors small scale and penalizes large scale. Auth0 at $1K/month for a startup is a bargain; Auth0 at $500K/year for 50M users may justify a custom build. AWS costs at startup scale are negligible; at Dropbox scale (exabytes of storage), building custom infrastructure saves hundreds of millions. Evaluate costs at your current scale AND your projected 3-year scale. Most companies never reach the scale where building is cheaper.
Vendor Lock-in vs Build Lock-inVendor lock-in is the fear of being dependent on a vendor's pricing, features, and stability. Build lock-in is the less-discussed counterpart: once you build a custom component, you are locked into maintaining it, recruiting for it, and investing in it indefinitely. You cannot cancel a subscription to a custom system. A vendor can be replaced (at a cost); a custom system must be perpetually staffed. Often, build lock-in is more expensive than vendor lock-in.
Case Study

Slack's Build-to-Buy Journey with Search

Scenario

Slack initially built their own search infrastructure to provide full-text message search across workspaces. As the product grew to millions of daily active users with billions of messages, the custom search system became increasingly difficult to scale and maintain. Search relevance, ranking, and performance required dedicated engineering expertise that competed with Slack's core product development priorities. The search team was perpetually understaffed relative to the problem's complexity, and search quality fell behind user expectations.

Solution

Slack evaluated their options: (1) invest heavily in the custom search system with a larger dedicated team, (2) migrate to a managed Elasticsearch service, or (3) adopt a commercial search solution. They chose to migrate to a managed Elasticsearch-based architecture, leveraging the mature full-text search capabilities, faceted search, relevance tuning, and horizontal scaling that Elasticsearch provides. The migration preserved their custom relevance models and security requirements while offloading the operational burden of running the search cluster to a managed service.

Outcome

The migration freed 8 engineers who had been maintaining the custom search infrastructure. Search quality improved because Elasticsearch's mature BM25 ranking, analyzers, and aggregation features exceeded what the small custom team had built. Search latency P99 improved from 2 seconds to 200ms. Operational incidents related to search dropped by 80%. The freed engineers were reassigned to Slack's core product features: channels, workflows, and the platform API -- components that directly differentiate Slack from competitors. This case illustrates that even when you have the capability to build, buying is often the right choice for non-core components.

Common Mistakes
  • Building because 'we can' rather than 'we should.' Engineering teams enjoy building systems, and 'Not Invented Here' syndrome is real. The question is not whether you have the talent to build a custom monitoring system -- you probably do. The question is whether your monitoring system will be better than Datadog after 3 years of part-time maintenance, and whether those engineering hours are better spent on your core product.
  • Comparing build cost to buy cost without including maintenance. The initial build takes 6 months and costs $300K in engineering time. Teams cite this against $100K/year for a SaaS license and conclude that building is cheaper after 3 years. But they forget the ongoing 0.5-1.0 FTE maintenance cost ($100-200K/year), on-call burden, and opportunity cost. The true 3-year TCO for building is often $800K-1M+, not $300K.
  • Over-weighting vendor lock-in risk. Teams spend 12 months building a custom solution to avoid vendor lock-in, then use that custom solution for 5+ years without ever needing to switch. The lock-in cost they avoided was theoretical; the build cost they incurred was real. Evaluate the actual probability and cost of needing to switch vendors, not the theoretical maximum.
  • Failing to revisit build vs buy as the landscape evolves. A custom deployment system built in 2018 (before mature managed Kubernetes) may now be inferior to EKS/GKE. A SaaS solution adopted at 100 users may be prohibitively expensive at 10M users. Schedule annual reviews of major build vs buy decisions, especially when your scale changes by 10x or when new solutions enter the market.
Related Concepts

See Build vs Buy in action

Explore system design templates that use build vs buy and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Estimate build vs managed service costs at different scales

Metrics to watch
engineering_hoursmonthly_cost_usdtime_to_market_daysoperational_overhead_pct
Run Simulation
Test Your Understanding

1A 20-person startup needs to add monitoring and alerting to their production system. Which approach is most appropriate?

2Dropbox migrated from S3 to a custom storage system (Magic Pocket). What justified this build decision?

Deeper Reading