Distributed Systems

Service Discovery

Цели урока

  • Explain why hardcoding IP addresses fails in cloud-native architectures
  • Compare DNS-based discovery and service registries by capabilities and limitations
  • Distinguish between self-registration and third-party registration patterns
  • Choose between client-side and server-side discovery for specific requirements

Предварительные знания

  • Understanding of microservice architecture (services as separate processes)
  • Basic knowledge of DNS (A records, TTL)
  • Familiarity with Kubernetes at the Pod and Service level

Netflix in 2012 moved to AWS: 800 microservices, each with dynamic IPs. A single deploy changed addresses on dozens of instances within seconds. That led to Eureka - the open-source service registry now running in thousands of companies.

  • **Kubernetes** - every one of 5+ million k8s clusters uses CoreDNS + Endpoints for service discovery out of the box
  • **Consul** by HashiCorp - 100,000+ production deployments including Stripe, Cloudflare, Adobe
  • **AWS ELB + Route 53** - server-side discovery for hundreds of thousands of applications in AWS
  • **Istio service mesh** - extends discovery to circuit breaking, canary deployments, and mTLS between services
  • **etcd** - every Kubernetes API server stores cluster state in etcd, including Endpoints used for discovery

Eureka and the birth of cloud-native discovery

Netflix open-sourced Eureka in 2012 after building it internally for the AWS migration. Before Eureka, static configs and proprietary solutions were the norm. Eureka introduced key concepts that became the industry standard: self-registration, heartbeat, and client-side caching. Today Eureka is part of Spring Cloud and runs in thousands of Java applications, though Netflix itself has moved to Consul for newer projects.

The Problem: IP Addresses Are Not Permanent

**Netflix, 2012. Migration to AWS. 800+ microservices, each scaling horizontally. An EC2 instance starts up - it gets a new IP. Restarts - another new IP. Hardcoding addresses in configs became impossible.** Netflix built Eureka - the first major open-source service registry. Today every Kubernetes cluster solves this same problem with built-in tooling.

**The core problem:** in a static architecture, service A knows service B's address in advance. In a dynamic environment, instances of B appear and disappear, scaling from 1 to 100 and back. A mechanism is needed to answer: "Which instances of service X are alive right now and ready to accept traffic?"

Three Reasons for Dynamic Addresses

CauseWhat HappensFrequency
AutoscalingNew instances spin up under load, old ones are terminatedMinutes to hours
Rolling deployNew version replaces old one pod by pod - each pod gets a new IPEvery deployment
Crash + restartKubernetes restarts a crashed pod on another node - new IP againUnpredictable

Service discovery is only needed at the scale of hundreds of microservices

Two services with autoscaling are enough to require a discovery mechanism

As soon as one service has more than one instance or restarts automatically, its IP becomes unstable. Kubernetes solves this for every application via ClusterIP and CoreDNS.

Why does hardcoding IP addresses fail in modern cloud architectures?

DNS as Service Discovery

**Kubernetes has used DNS for service discovery since its first release in 2014. Every Service gets a DNS name like `payment-service.default.svc.cluster.local`. Pods know no IP addresses - only names.** DNS is the oldest and most universal discovery mechanism, but its limitations matter in practice.

Limitations of DNS-based Discovery

AdvantageLimitationConsequence
Works everywhere without extra infrastructureDNS TTL caches recordsClient keeps sending requests to a dead IP until TTL expires
Standard protocol - supported by every languageNo built-in health checksRegistry does not know if an instance is unhealthy - only that it exists
Kubernetes CoreDNS updates records quicklyNo metadata (version, region, weight)Cannot do canary deploys through DNS without additional tooling

**DNS TTL trap:** the JVM and many HTTP clients aggressively cache DNS responses. Even with TTL=10s an application may hold a stale IP for minutes. The fix is to use Kubernetes Service (ClusterIP), which manages routing via iptables/ipvs directly, bypassing DNS cache issues.

A client resolved a service IP via DNS. The service crashes. DNS TTL = 30 seconds. What happens?

Service Registry: Consul, etcd, ZooKeeper

**Consul was launched by HashiCorp in 2014. Within two years: 10,000 production deployments.** The reason for its success: built-in health checks, watch API for instant updates, and a KV store for configs - all in one tool. etcd was chosen by Kubernetes in 2014 as its cluster state store. Since then etcd runs in every Kubernetes cluster on the planet.

Self-registration vs Third-party registration

PatternWho RegistersAdvantageDrawback
Self-registrationThe service itself on startupService controls its own metadataEvery service needs the registry SDK
Third-partyExternal registrar (Kubernetes, Registrator)Services are unaware of the registryRegistrar is an extra component and a potential SPOF

**Kubernetes uses third-party registration:** the Endpoints controller watches pods via the API server and automatically updates the Endpoints object when pods change. The service contains zero lines of registration code.

Popular Registries

SystemConsensusHealth ChecksUse Case
ConsulRaftHTTP / TCP / gRPC / scriptService mesh, multi-datacenter
etcdRaftVia TTL keys or watchKubernetes state store, config
ZooKeeperZABEphemeral nodes (disappear on disconnect)Legacy: Kafka, Hadoop
Kubernetes DNS + EndpointsBuilt into k8sReadiness probeOut of the box in k8s clusters

Heartbeat equals health check: if a process sends a heartbeat, the service is healthy

Heartbeat checks process liveness; health check verifies service functionality

A process can be alive (sending heartbeats) but stuck in a deadlock, exhausted its connection pool, or lost its database connection. The /health endpoint should verify dependencies (DB, cache, queues) and return 503 when any of them are degraded.

A service instance is frozen (process is alive but not responding to requests). Which registry mechanism will detect this?

Client-Side vs Server-Side Discovery

**Netflix OSS (2012-2015) standardized client-side discovery with Ribbon + Eureka. AWS ELB and Kubernetes implement server-side. Both patterns are in production today** - the choice depends on how much complexity is acceptable in the client and whether an extra network hop matters.

Client-Side Discovery

The client queries the registry, receives the list of healthy instances, and picks one using a load-balancing algorithm.

Server-Side Discovery

The client sends requests to a fixed load balancer address. The LB consults the registry and picks an instance on its own.

ApproachAdvantagesDrawbacksExample
Client-SideFlexible balancing, no SPOFEvery client must be updated when balancing logic changesNetflix Ribbon + Eureka
Server-SideClient is simple - one fixed addressExtra network hop, LB is a potential SPOFAWS ELB, Kubernetes Service

In **Kubernetes** server-side discovery is built in: addressing a service by name (`http://payment-service`) resolves via CoreDNS to a ClusterIP, and kube-proxy routes the traffic to one of the healthy Endpoints. The infrastructure is completely transparent to the application.

A team wants canary deployments: 10% of traffic to v2, 90% to v1. Which discovery approach fits best?

Вопросы для размышления

  • In Kubernetes, service discovery works transparently - the application knows nothing about any registry. What actually happens inside the cluster when a pod crashes and a new one starts up? Which components update the information about available instances?

Связанные уроки

  • ds-09-gossip-protocols — Gossip (SWIM) is used for health propagation in discovery
  • dist-09-raft — Consul uses Raft for consistent service registry reads
  • ds-11-distributed-locks — Registry and lock service are both linearisable KV stores
  • dist-12-consistency — Discovery reads need monotonic consistency guarantees
  • net-18-dns-basics
Service Discovery

0

1

Sign In