1A 20-person startup needs to add monitoring and alerting to their production system. Which approach is most appropriate?
The build vs buy decision determines whether to develop a system component in-house or adopt a third-party solution (SaaS, open-source, managed service). This trade-off affects engineering velocity, operational costs, competitive differentiation, and long-term flexibility. It is among the most impactful architectural decisions and appears frequently in staff-level system design interviews.
The build vs buy decision is deceptively complex. Engineers often default to building because custom solutions feel more powerful, more flexible, and more interesting. But building carries costs that are chronically underestimated: ongoing maintenance, on-call burden, feature parity with commercial solutions, security patching, documentation, and the opportunity cost of engineering time not spent on core product features. A home-built authentication system requires years of security hardening that Auth0 or Okta have already done. A custom monitoring system requires years of feature development that Datadog provides out of the box.
The core principle is: build what differentiates your business; buy everything else. Stripe builds its own payment processing infrastructure because that is their core product. But Stripe uses AWS (they do not build their own data centers), Bugsnag for error tracking, and PagerDuty for incident management. Google builds its own database (Spanner) because distributed data storage is core to their business. But even Google uses third-party tools for non-core functions. The question is not 'can we build it?' but 'should we spend our scarce engineering resources building it instead of working on our core product?'
Total Cost of Ownership (TCO) analysis over 3-5 years is the most rigorous way to evaluate build vs buy. Building costs include: initial development (engineer salaries for 3-12 months), ongoing maintenance (typically 20% of initial build cost per year), on-call rotation, infrastructure costs, security audits, feature development to keep pace with user needs, and opportunity cost of engineers not working on core features. Buying costs include: license fees, integration development, vendor management overhead, potential vendor lock-in switching costs, and limitations on customization. In most cases, the buy TCO is significantly lower for non-core components.
The third option -- adopting and operating open-source software -- is a middle ground that combines some advantages and disadvantages of both. Open-source gives you source code access (no vendor lock-in), community-driven feature development, and no license fees. But it requires operational expertise: you are responsible for deployment, upgrades, security patches, scaling, and troubleshooting. Running your own Kafka cluster or Elasticsearch cluster requires dedicated infrastructure expertise. Managed services (AWS MSK for Kafka, Elastic Cloud for Elasticsearch) shift the operational burden to the vendor at a higher monetary cost but lower engineering cost.
Custom Auth System vs Auth0
A startup needs user authentication for their SaaS product. Option A (Build): 2 engineers spend 3 months building a custom auth system with password hashing, JWT tokens, OAuth integration, MFA, rate limiting, account lockout, password reset flows, and session management. Total initial cost: ~$150K in engineering time. Ongoing: 0.5 engineer for maintenance, security patches, and feature additions = ~$100K/year. After 3 years: ~$450K total. Option B (Buy Auth0): 1 engineer spends 2 weeks integrating Auth0. Auth0 cost at their scale: ~$30K/year. Integration maintenance: ~0.05 engineer/year = ~$10K/year. After 3 years: ~$150K total (including $90K Auth0 fees + $30K integration + $30K maintenance). The buy option costs 3x less AND those 2 engineers spend 3 months building product features instead of authentication. Auth0 also provides features the custom build would not: breached password detection, anomaly detection, compliance certifications (SOC 2, HIPAA) -- features that would cost years of additional development to build in-house.
Dropbox
Dropbox initially stored all files on Amazon S3 (buy). As they scaled to exabytes, S3 costs became a significant expense, and they needed performance optimizations that S3 could not provide. In 2015, they built their own storage system (Magic Pocket), migrating 90% of user data off S3. This massive build effort was justified because storage is Dropbox's core competency -- their entire product is file storage. The migration saved hundreds of millions of dollars annually. However, Dropbox continues to use AWS for non-core services: they buy their monitoring, CI/CD, and other commodity infrastructure.
Figma
Figma built a custom real-time collaborative editing engine (including a custom WebGL rendering engine and CRDT-based multiplayer system) because real-time collaboration is their core differentiator. However, they buy Stripe for payments, Datadog for monitoring, LaunchDarkly for feature flags, and Auth0 for authentication. This is a textbook application of 'build the core, buy the rest.' The custom collaboration engine took years to build and is what makes Figma uniquely valuable; the purchased components are commodities that multiple vendors provide well.
Segment
Segment started by building their own message queue for customer data routing, but eventually migrated to Amazon Kinesis and later Kafka. Their initial custom queue was a source of constant operational pain -- message loss bugs, scaling issues, and on-call incidents. They realized that message queuing was not their core differentiator (customer data integration was), and that Kafka provided better reliability, scalability, and ecosystem integration than they could build with a small team. The migration freed engineers to focus on connectors and data transformations -- their actual product.
| Aspect | Description |
|---|---|
| Customization vs Time-to-Market | Building gives you exactly what you need -- custom features, tight integration with your architecture, and full control over the roadmap. Buying gives you 80-90% of what you need in days or weeks instead of months. The gap between 'exactly what we want' and 'close enough' is rarely worth the 6-12 month delay. Ship with the bought solution now, and if the 10% gap becomes a real problem at scale, consider building then -- with the benefit of understanding your actual requirements rather than guessed ones. |
| Control vs Operational Burden | Building gives you full control: you can fix bugs immediately, prioritize features you need, and avoid depending on a vendor's roadmap. But this control comes with operational burden: on-call for the component, security patching, performance tuning, capacity planning, and documentation. Buying outsources this operational burden to a vendor with a dedicated team. For non-core components, the vendor's team is almost certainly larger and more experienced at operating that specific system than yours. |
| Cost at Small Scale vs Cost at Large Scale | SaaS pricing typically favors small scale and penalizes large scale. Auth0 at $1K/month for a startup is a bargain; Auth0 at $500K/year for 50M users may justify a custom build. AWS costs at startup scale are negligible; at Dropbox scale (exabytes of storage), building custom infrastructure saves hundreds of millions. Evaluate costs at your current scale AND your projected 3-year scale. Most companies never reach the scale where building is cheaper. |
| Vendor Lock-in vs Build Lock-in | Vendor lock-in is the fear of being dependent on a vendor's pricing, features, and stability. Build lock-in is the less-discussed counterpart: once you build a custom component, you are locked into maintaining it, recruiting for it, and investing in it indefinitely. You cannot cancel a subscription to a custom system. A vendor can be replaced (at a cost); a custom system must be perpetually staffed. Often, build lock-in is more expensive than vendor lock-in. |
Slack's Build-to-Buy Journey with Search
Scenario
Slack initially built their own search infrastructure to provide full-text message search across workspaces. As the product grew to millions of daily active users with billions of messages, the custom search system became increasingly difficult to scale and maintain. Search relevance, ranking, and performance required dedicated engineering expertise that competed with Slack's core product development priorities. The search team was perpetually understaffed relative to the problem's complexity, and search quality fell behind user expectations.
Solution
Slack evaluated their options: (1) invest heavily in the custom search system with a larger dedicated team, (2) migrate to a managed Elasticsearch service, or (3) adopt a commercial search solution. They chose to migrate to a managed Elasticsearch-based architecture, leveraging the mature full-text search capabilities, faceted search, relevance tuning, and horizontal scaling that Elasticsearch provides. The migration preserved their custom relevance models and security requirements while offloading the operational burden of running the search cluster to a managed service.
Outcome
The migration freed 8 engineers who had been maintaining the custom search infrastructure. Search quality improved because Elasticsearch's mature BM25 ranking, analyzers, and aggregation features exceeded what the small custom team had built. Search latency P99 improved from 2 seconds to 200ms. Operational incidents related to search dropped by 80%. The freed engineers were reassigned to Slack's core product features: channels, workflows, and the platform API -- components that directly differentiate Slack from competitors. This case illustrates that even when you have the capability to build, buying is often the right choice for non-core components.
See Build vs Buy in action
Explore system design templates that use build vs buy and run traffic simulations to see how these concepts perform under real load.
Browse Templates1A 20-person startup needs to add monitoring and alerting to their production system. Which approach is most appropriate?
2Dropbox migrated from S3 to a custom storage system (Magic Pocket). What justified this build decision?