Vetora logo
📇Networking & Protocols

DNS (Recursive, Authoritative, Anycast)

The Domain Name System translates human-readable domain names into IP addresses through a hierarchical resolution process involving recursive resolvers, root servers, TLD servers, and authoritative nameservers. DNS is also used for load balancing, failover, and traffic routing.

Overview

The Domain Name System (DNS) is one of the most critical pieces of internet infrastructure, translating human-readable domain names (like google.com) into IP addresses (like 142.250.80.46) that machines use to route packets. DNS resolution is the very first step in nearly every internet interaction, and its performance and reliability directly impact every system that depends on the internet. Understanding DNS deeply -- its hierarchical architecture, caching behavior, record types, and use as a traffic management tool -- is essential for system design because DNS failures are among the most impactful outages (as the 2016 Dyn attack demonstrated).

DNS operates as a distributed, hierarchical database. At the top are 13 logical root server clusters (named a.root-servers.net through m.root-servers.net), each operated by different organizations and replicated globally via Anycast. Root servers do not know every domain -- they delegate to TLD (Top-Level Domain) servers for .com, .org, .net, etc. TLD servers delegate to authoritative nameservers for specific domains (e.g., ns1.google.com for google.com). When a client needs to resolve a domain, it sends a query to a recursive resolver (typically provided by the ISP, or a public resolver like Cloudflare 1.1.1.1 or Google 8.8.8.8). The recursive resolver walks the hierarchy: query root for .com, query .com TLD for google.com, query google.com's authoritative nameserver for the final IP address. Each response includes a TTL (Time To Live) that determines how long the result can be cached, avoiding repeated hierarchy traversals.

DNS supports multiple record types that serve different purposes. A records map a domain to an IPv4 address, AAAA records to IPv6. CNAME records create aliases (www.example.com as an alias for example.com). MX records specify mail servers. TXT records hold arbitrary text, commonly used for domain verification (SPF, DKIM, DMARC) and challenge-response verification (Let's Encrypt). NS records delegate a subdomain to different nameservers. SRV records provide service discovery with host, port, priority, and weight -- used by protocols like LDAP and SIP. SOA (Start of Authority) records define zone metadata including the primary nameserver and TTL defaults.

Beyond simple name resolution, DNS serves as a powerful traffic management and load balancing tool. Round-robin A records return multiple IP addresses in rotating order, distributing traffic across servers. Weighted routing (supported by AWS Route 53 and Cloudflare) allows directing a percentage of traffic to specific endpoints. Geo-DNS returns different IP addresses based on the client's geographic location, routing European users to European servers and US users to US servers. Latency-based routing (Route 53) measures latency from resolver locations to each endpoint and returns the lowest-latency option. Health-checked failover removes unhealthy endpoints from DNS responses within seconds. Anycast -- announcing the same IP address from multiple global locations via BGP -- routes clients to the nearest instance automatically. This combination of features makes DNS a first-hop load balancer and failover mechanism that operates before the HTTP request is even sent.

Key Points
  • 1DNS hierarchy: root servers (13 logical clusters via Anycast) delegate to TLD servers (.com, .org), which delegate to authoritative nameservers (ns1.google.com) that hold the actual A/AAAA/CNAME records for domains.
  • 2Recursive resolvers (1.1.1.1, 8.8.8.8, ISP resolvers) walk the hierarchy on behalf of clients and cache results for the duration of the TTL. A cold resolution requires 3-4 round trips; a cached resolution is instant from the resolver's perspective.
  • 3TTL (Time To Live) controls how long DNS responses are cached at each layer. Short TTLs (30-300s) enable fast failover but increase query volume to authoritative servers. Long TTLs (3600s+) reduce query load but make changes propagate slowly.
  • 4Anycast allows the same IP address to be announced from multiple geographic locations via BGP. The network automatically routes each client to the nearest Anycast instance. This is how Cloudflare's 1.1.1.1 resolver is served from 310+ cities worldwide.
  • 5DNS-based load balancing includes round-robin (multiple A records), weighted routing (percentage-based), geo-DNS (location-based), and latency-based routing (lowest RTT). Health checks can remove failed endpoints from DNS responses within seconds.
  • 6DNS is a single point of failure for all services behind a domain. The 2016 Dyn DDoS attack (Mirai botnet) took down DNS resolution for major sites including Twitter, Netflix, and Reddit, demonstrating that DNS infrastructure must be treated as critical infrastructure with redundancy across multiple providers.
Simple Example

The Phone Book Analogy

DNS works like a hierarchical phone book system. When you want to call 'John Smith at Acme Corp in New York,' you first call the global directory (root server) which says 'for US companies, call this number' (TLD server). The US directory says 'for Acme Corp, call this number' (authoritative nameserver). Acme Corp's receptionist gives you John's direct number (IP address). You write the number down (cache it) and it stays valid for a day (TTL). Next time you call John, you use the number from your notes instead of going through the whole directory chain. If Acme Corp moves offices, your cached number is wrong until the TTL expires.

Real-World Examples

Cloudflare

Cloudflare operates the 1.1.1.1 public DNS resolver from 310+ cities worldwide using Anycast. Every user's DNS query is routed to the nearest Cloudflare PoP by BGP routing, achieving a median resolution time under 11ms globally. Cloudflare also provides authoritative DNS for millions of domains, with a 100% uptime SLA backed by Anycast redundancy. Their DNS infrastructure handles over 1 trillion DNS queries per day.

AWS Route 53

Route 53 is AWS's authoritative DNS service offering advanced traffic routing: weighted routing (send 90% of traffic to us-east-1, 10% to eu-west-1), latency-based routing (measure and route to the lowest-latency region), geo-location routing (EU users get EU endpoints), and health-checked failover (automatically remove unhealthy endpoints from DNS responses within 30 seconds). Route 53 uses a global Anycast network with a 100% availability SLA.

Dyn (2016 Attack)

In October 2016, the Mirai botnet launched a massive DDoS attack against Dyn, a major DNS provider. The attack generated over 1 Tbps of traffic against Dyn's DNS infrastructure, causing DNS resolution failures for major services including Twitter, Netflix, Reddit, GitHub, and Spotify. The attack demonstrated that DNS is a critical single point of failure: even though the target services' own infrastructure was unaffected, users could not reach them because domain names could not be resolved to IP addresses.

Trade-Offs
AspectDescription
TTL Length: Fast Failover vs Query VolumeShort TTLs (30-60 seconds) allow rapid failover by ensuring clients re-resolve frequently, picking up new IP addresses quickly when backends change. But short TTLs increase query volume to authoritative nameservers by 60-120x compared to 1-hour TTLs, increasing cost and load. Long TTLs reduce query volume but mean DNS changes take up to TTL seconds to propagate.
DNS Load Balancing vs Application Load BalancingDNS-based load balancing is simple and operates before the HTTP connection, but it is coarse-grained: DNS responses are cached, so traffic distribution is approximate rather than per-request. Application-layer load balancers (L7) provide precise per-request routing, health checking, and content-based routing, but require infrastructure in the request path.
Anycast Simplicity vs Routing UnpredictabilityAnycast routes clients to the nearest instance automatically via BGP, requiring no client-side logic. However, BGP routing changes can cause clients to temporarily shift between Anycast instances, potentially dropping stateful connections. Anycast works best for stateless protocols like DNS and CDN edge serving.
Single Provider vs Multi-Provider DNSUsing a single DNS provider is simpler to manage but creates a single point of failure (as the Dyn attack showed). Multi-provider DNS (e.g., Route 53 + Cloudflare) provides redundancy but adds operational complexity: records must be synchronized across providers, and provider-specific features (weighted routing, geo-DNS) may differ.
Case Study

The 2016 Dyn DDoS Attack -- DNS as Critical Infrastructure

Scenario

In October 2016, the Mirai botnet -- composed of hundreds of thousands of compromised IoT devices (cameras, DVRs, routers) -- launched a distributed denial-of-service attack against Dyn, one of the major managed DNS providers. Dyn provided authoritative DNS for many high-profile services. When Dyn's DNS infrastructure became unreachable under the attack, DNS queries for these domains could not be resolved, effectively making the services unreachable even though their own servers were operating normally.

Solution

Dyn worked with upstream network providers and law enforcement to mitigate the attack through traffic filtering and Anycast-based traffic absorption. In the aftermath, affected companies diversified their DNS infrastructure: many added secondary DNS providers (Cloudflare, Route 53, Google Cloud DNS) so that if one provider is attacked, the other continues resolving queries. NS records were updated to include nameservers from multiple providers, and automated synchronization tools ensured record consistency across providers.

Outcome

The attack lasted approximately 11 hours and affected tens of millions of users. Major services including Twitter, Netflix, Reddit, CNN, and The New York Times experienced intermittent outages. The incident was a watershed moment for DNS resilience: it demonstrated that DNS is the most critical single point of failure on the internet. Multi-provider DNS became a best practice, and companies like Cloudflare invested heavily in DDoS-resistant DNS infrastructure with Anycast networks capable of absorbing terabits per second of attack traffic.

Common Mistakes
  • Setting extremely long TTLs (24+ hours) to reduce DNS costs, then being unable to fail over quickly when a data center goes down. Critical services should use TTLs of 60-300 seconds to enable rapid DNS-based failover, accepting the higher query volume.
  • Relying on a single DNS provider without secondary failover. The Dyn attack proved that DNS provider failure takes down all services behind it. Critical systems should use at least two DNS providers with synchronized records.
  • Ignoring DNS propagation delays when making infrastructure changes. After updating DNS records, clients will continue using the old cached values until the TTL expires. Plan DNS changes with propagation time in mind, especially for migrations.
  • Using CNAME records at the zone apex (e.g., example.com CNAME other.example.com). RFC 1034 forbids CNAME at the zone apex because it conflicts with SOA and NS records. Use ALIAS/ANAME records (provider-specific) or A records for zone apex routing.
Related Concepts

See DNS (Recursive, Authoritative, Anycast) in action

Explore system design templates that use dns (recursive, authoritative, anycast) and run traffic simulations to see how these concepts perform under real load.

Browse Templates

See DNS resolution impact on URL redirect latency

Metrics to watch
dns_lookup_msredirect_latency_mscache_hit_ratiottl_expiry_rate
Run Simulation
Test Your Understanding

1What is the role of a recursive DNS resolver?

2Why did the 2016 Dyn DDoS attack cause widespread internet outages even though target websites' servers were functioning normally?

3How does Anycast improve DNS resolver performance?

Deeper Reading