Vetora logo
⚠️Foundations

Fallacies of Distributed Computing

The eight fallacies of distributed computing are false assumptions that developers make when building networked systems. First articulated by Peter Deutsch at Sun Microsystems in 1994, these fallacies remain relevant in cloud-native architecture and explain why distributed systems are fundamentally harder than single-machine programs.

Overview

The fallacies of distributed computing are eight false assumptions that developers implicitly make when building systems that communicate over a network. Peter Deutsch identified the first seven at Sun Microsystems in 1994, and James Gosling added the eighth. Despite being over 30 years old, these fallacies remain the most common source of distributed system failures because the cloud abstracts the network just enough to make developers forget it exists -- until it breaks.

The first fallacy -- the network is reliable -- is the most impactful. Every network call can fail: packets are dropped, connections time out, DNS fails to resolve, and entire availability zones go offline. In 2017, a single S3 outage in us-east-1 cascaded across hundreds of services because they assumed S3 was always reachable. The second fallacy -- latency is zero -- causes systems designed for local function calls to perform catastrophically when those calls cross a network. A loop that makes 100 sequential database calls takes microseconds locally but seconds over a network. The third fallacy -- bandwidth is infinite -- leads to chatty protocols that transfer entire objects when only a few fields are needed, collapsing under load when the network saturates.

The remaining fallacies are equally dangerous. The network is secure (fallacy 4) leads to unencrypted internal traffic that is exploitable after any perimeter breach. Topology doesn't change (fallacy 5) causes hard-coded IP addresses and brittle service configurations that break during deployments, failovers, or cloud instance replacements. There is one administrator (fallacy 6) ignores that modern systems span multiple teams, cloud providers, and even organizations, each with different policies, maintenance windows, and security requirements. Transport cost is zero (fallacy 7) ignores serialization overhead, TLS handshake costs, and cloud data-transfer charges that can dominate infrastructure budgets. The network is homogeneous (fallacy 8) assumes uniform network equipment and protocols, when real systems traverse WiFi, cellular, corporate firewalls, CDN edges, and submarine cables.

These fallacies remain critically relevant in 2026 because cloud computing creates a dangerous illusion of reliability. Managed services, auto-scaling, and multi-AZ deployments hide the network's complexity but do not eliminate it. A developer calling an AWS Lambda function from another Lambda function is still making a network call that can time out, be throttled, or hit a cold start. Understanding these fallacies is the first step toward building resilient systems that handle failure as a normal operating condition rather than an exceptional surprise.

Key Points
  • 1Fallacy 1: The network is reliable. Packets are dropped, connections time out, and entire regions go offline. Every remote call needs retry logic, circuit breakers, and graceful degradation paths.
  • 2Fallacy 2: Latency is zero. A local function call takes nanoseconds; a network call takes milliseconds. This 1,000,000x difference means distributed systems must minimize round trips, batch requests, and cache aggressively.
  • 3Fallacy 3: Bandwidth is infinite. Transferring 1MB payloads per API call works in development but saturates network links at production scale. Design APIs that transfer only the data needed (pagination, field selection, compression).
  • 4Fallacy 4: The network is secure. Internal networks are not inherently safe. Zero-trust architecture assumes every network call could be intercepted and requires authentication, encryption (mTLS), and authorization at every service boundary.
  • 5Fallacy 5: Topology doesn't change. Cloud instances are ephemeral. IP addresses change with every deployment. Service discovery (DNS, Consul, Kubernetes Services) must replace hard-coded endpoints.
  • 6Fallacy 6: There is one administrator. Modern systems span teams, cloud accounts, and third-party providers. Each has different maintenance windows, rate limits, and failure modes that affect your system's availability.
Simple Example

The Office Network Analogy

Imagine you are passing notes to a colleague across the office. You assume: the note always arrives (reliable), it arrives instantly (zero latency), you can send as many notes as you want (infinite bandwidth), nobody reads it in transit (secure), your colleague is always at the same desk (topology stable), and only you control the note-passing system (one admin). Now imagine the office expands to 10 buildings across 3 cities. Notes get lost in transit, take hours to arrive, the mail room has capacity limits, anyone in the mail chain can read them, people move desks daily, and different buildings have different mail policies. Distributed computing is the 10-building version, but developers keep designing for the single-office case.

Real-World Examples

AWS (us-east-1 S3 Outage, 2017)

A typo in a maintenance command took down a large subset of S3 in us-east-1 for four hours. Hundreds of services -- including parts of the AWS console itself -- failed because they assumed S3 was always available (Fallacy 1: network is reliable). Services that had implemented graceful degradation (serving cached content, falling back to local storage) continued operating. This incident is the canonical example of why every dependency must have a failure path.

Cloudflare (Latency Spikes and Route Leaks)

Cloudflare has experienced multiple incidents where BGP route leaks or backbone congestion caused unexpected latency spikes of 100-500ms on paths that normally took 5ms (Fallacy 2: latency is zero). Services with tight timeout budgets failed entirely, while those with adaptive timeouts and regional failover continued serving traffic. Their 2020 backbone outage also demonstrated Fallacy 5 (topology changes) when routing shifts sent traffic to unexpected paths.

Slack (Database Sharding Migration)

Slack's migration from a single MySQL instance to a sharded architecture exposed Fallacies 5 and 6. The topology changed fundamentally (queries now had to route to the correct shard), and different teams owned different shards with varying maintenance schedules. The migration took over two years because the original system was designed assuming a single database (stable topology, one admin) and retrofitting shard-awareness into existing code was far more complex than building it from scratch.

Trade-Offs
AspectDescription
Resilience vs SimplicityDesigning for the fallacies adds complexity: retry logic, circuit breakers, timeouts, service discovery, encryption. A simple direct HTTP call becomes wrapped in 50 lines of resilience code. The trade-off is between simplicity (which works until it doesn't) and resilience (which adds development cost but prevents cascading failures).
Latency vs CorrectnessDefensive patterns like retries and synchronous replication (addressing Fallacy 1 and Fallacy 2) add latency. A retry with exponential backoff can turn a 50ms call into a 3-second wait. Balancing the number of retries, timeout durations, and fallback strategies requires careful tuning against latency budgets.
Security vs PerformanceAddressing Fallacy 4 (network is secure) means encrypting all inter-service communication with mTLS, adding authentication headers, and running authorization checks. This adds 1-5ms per call for TLS handshakes and token validation. Service meshes like Istio automate this but add proxy overhead to every request.
Observability vs CostDetecting fallacy-related failures requires distributed tracing, per-hop latency monitoring, bandwidth metrics, and topology-aware alerting. This observability infrastructure has significant cost -- in tooling, storage, and engineering time -- but is essential for diagnosing issues that span multiple network hops and administrative domains.
Case Study

Netflix's Chaos Engineering Approach to the Fallacies

Scenario

Netflix runs thousands of microservices across multiple AWS regions. Despite using managed cloud services, they experienced repeated incidents caused by the fallacies: services timing out when a downstream dependency was slow (Fallacy 2), cascading failures when a service became unreachable (Fallacy 1), and unexpected behavior during region failovers (Fallacy 5). These failures often occurred in combinations that were difficult to anticipate during design reviews.

Solution

Netflix built Chaos Engineering into their development culture, starting with Chaos Monkey (randomly terminates instances) and expanding to Chaos Kong (simulates entire region failures). They enforced resilience patterns through their OSS libraries: Hystrix for circuit breaking, Ribbon for client-side load balancing with retries, and Eureka for service discovery. Every service was required to define a fallback behavior for each dependency -- what to return when the dependency is unreachable, slow, or returning errors.

Outcome

Netflix can survive the loss of an entire AWS region with minimal user impact. Their 2015 us-east-1 evacuation completed in under 10 minutes with no customer-facing errors. The chaos engineering approach forced every team to internalize the fallacies during development rather than discovering them during production incidents. The mean time to recovery for dependency failures dropped from hours to seconds because fallback behaviors were pre-tested and proven.

Common Mistakes
  • Treating network calls like local function calls. A local method call takes nanoseconds, never fails due to network issues, and has zero serialization cost. Network calls are 1,000,000x slower, can fail in dozens of ways, and require serialization. Never design a distributed system as if it were a monolith with longer wires.
  • Hard-coding IP addresses and hostnames. Cloud instances are ephemeral -- IPs change on every restart, and hostnames may resolve to different machines after a failover. Always use service discovery (DNS, Consul, Kubernetes Services) and design for endpoint changes.
  • Assuming retries are free. Naive retry logic (retry immediately on failure) can amplify failures: if a downstream service is overloaded, retries from all callers quadruple the load. Use exponential backoff with jitter, circuit breakers, and retry budgets.
  • Ignoring cross-region data transfer costs. Cloud providers charge $0.01-0.09 per GB for inter-region transfer. A chatty protocol that transfers 10MB per request at 10,000 requests per minute can cost $50,000+ per month in transfer fees alone. Monitor and minimize cross-region payload sizes.
Related Concepts

See Fallacies of Distributed Computing in action

Explore system design templates that use fallacies of distributed computing and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Watch network fallacies cause real failures in a chat system

Metrics to watch
message_delivery_ratep99_latency_mserror_rate_pctreconnection_count
Run Simulation
Test Your Understanding

1Which fallacy is violated when a developer writes a for-loop that makes 100 sequential API calls to a remote service?

2A service fails because a downstream dependency changed its IP address after a cloud instance restart. Which fallacy does this illustrate?

3Why are the fallacies of distributed computing still relevant in 2026 despite cloud managed services?

Deeper Reading