Cloud Computing
DNS and Route 53
In 2019, Cloudflare caused a global outage with a faulty WAF regex rule - for 30 minutes all their traffic went nowhere. Route 53's 100% SLA and built-in failover are designed so that DNS is never a single point of failure. Understanding routing policies and health checks is the difference between 99.9% and 99.99% uptime.
- **Amazon.com** uses Route 53 Latency-based routing to automatically direct users to the nearest region - without it, every request from Tokyo would hit us-east-1, adding 150 ms of latency
- **Shopify** uses Geolocation routing to comply with regional laws: European stores are served from EU regions, simplifying GDPR compliance for merchant data
- **Netflix** uses Weighted routing to gradually shift traffic during deployments: starting at 1% of the new region and increasing to 100% with zero downtime
Hosted Zones and DNS Records
DNS is the internet's phone book, and Route 53 is its most reliable implementation: a 100% availability SLA - the only such promise AWS makes for any service. A **Hosted Zone** in Route 53 is a container for DNS records for a domain. A public hosted zone answers queries from the internet; a private one answers only within specified VPCs. The key differentiator from standard DNS: **Alias records** point directly to an AWS resource (ALB, CloudFront, S3) without specifying an IP address - AWS updates the records automatically when the resource's IP changes.
Alias vs CNAME: CNAME cannot be created at the zone apex (example.com); Alias can. Alias queries to AWS resources are free (unlike standard A/CNAME queries). Supported record types: A, AAAA, CNAME, MX, NS, PTR, SOA, SRV, TXT, CAA, DS, NAPTR. TTL controls propagation speed - lower TTL means faster failover.
Why is an Alias record preferred over CNAME for pointing to an ALB at the zone apex (example.com)?
Routing Policies
Route 53 does more than store IP addresses - it makes decisions. Six routing policies turn DNS into an intelligent traffic controller. **Weighted routing** distributes requests proportionally: weight 80 and 20 sends 80% and 20% of traffic - ideal for canary deployments without touching application code. **Geolocation routing** sends European users to EU servers and American users to US servers. GDPR requires that EU residents' data stays in Europe - Route 53 Geolocation enforces this at the DNS level.
Routing policies: Simple (single resource), Weighted (percentage split), Latency-based (lowest latency to a region), Failover (active/passive), Geolocation (by country/continent), Geoproximity (by distance with bias), Multivalue Answer (up to 8 IPs with health checks). Latency-based and Geoproximity require Traffic Flow (additional cost from USD 50/month).
An application must comply with GDPR: EU users must land on servers in eu-west-1. Which Route 53 routing policy is correct?
Health Checks
DNS without health checks is like a navigation app that routes to a closed road. Route 53 health checks are a distributed monitoring system: more than 15 AWS checker nodes worldwide send HTTP/HTTPS/TCP requests to a resource every 10-30 seconds. A resource is marked unhealthy when more than 18% of checker nodes report failure. This threshold prevents false positives caused by localized network problems at a single node.
Three types of health checks: Endpoint (HTTP/HTTPS/TCP to an IP or domain), Calculated (combines multiple health checks via AND/OR/NOT), CloudWatch Alarm (tied to a metric alarm state). Endpoint checks can verify a string in the response body (up to 5 120 bytes). Pricing: USD 0.50/month per endpoint within AWS regions, USD 1.00/month for non-AWS resources.
At what threshold does Route 53 mark a resource as unhealthy based on checker node reports?
DNS Failover
Route 53 Failover routing is an automatic switch that activates without human intervention. The active/passive scheme: the primary resource serves traffic; the secondary sits idle until a failure. When the primary health check reports a problem, Route 53 starts returning the secondary's IP - but TTL slows the switchover. Rule of thumb: set a low TTL (60 seconds or less) on active records in production with strict SLAs.
Active/active failover is achieved via Weighted or Multivalue routing with health checks: all endpoints are active, unhealthy ones are automatically excluded from responses. This differs from Failover routing (active/passive). Total switchover time = health check detection time (30-90 sec) + TTL. With TTL=60 sec, minimum downtime is roughly 2 minutes.
DNS Failover provides instant traffic switching when a failure occurs
Switchover takes at minimum 30-90 seconds (health check detection) plus the TTL duration during which clients cache the old IP
DNS is cached at the client, ISP resolver, and OS levels. Even after Route 53 updates the record, clients continue using the old IP until TTL expires. The solution is a low TTL in production environments with strict SLA requirements.
Which DNS record parameter is most critical for minimizing switchover time in Failover routing?
Key Ideas
- **Alias records** are a Route 53 extension that works at the zone apex and is free for AWS resources - always prefer Alias over CNAME when pointing to AWS endpoints
- **Routing Policies** turn DNS into a traffic controller: Weighted for canary, Geolocation for GDPR, Latency-based for performance, Failover for disaster recovery
- **TTL is the critical parameter**: low TTL (60 sec) speeds up failover and canary rollouts; high TTL (86400 sec) reduces DNS query load but slows change propagation
Related Topics
Route 53 is the entry point for all AWS infrastructure traffic:
- Load Balancing and CDN — ALB and CloudFront are the primary targets of Route 53 Alias records; Failover routing switches between them on failure
- VPC and Network Isolation — Private hosted zones work only within a VPC, providing DNS resolution for internal services without exposure to the public internet
Вопросы для размышления
- An application is deployed in two AWS regions. How would you configure Route 53 for automatic failover when the primary region goes down, and what TTL would you set?
- Weighted routing is used for a canary deployment. How would you gradually increase the new version's weight - and how would you roll back within 60 seconds if problems appear?
- Geolocation and Latency-based routing can return different results for the same user. Which policy takes priority for a global SaaS product, and what are the tradeoffs?