DevOps

ELK Stack and Logging

Slack processes 10 billion events per day. When a service fails at 3am, the on-call engineer opens Kibana and finds the root cause in 2 minutes using trace.id correlation across 40 microservices. Without structured logging and ELK, that same investigation takes 45 minutes of SSH sessions.

**GitHub** uses Elasticsearch for search across 100+ billion lines of code and CI/CD logs - a cluster of 100+ nodes handles thousands of queries per second with ILM for cost control.
**Cloudflare** stores DNS and HTTP logs (2 trillion requests per day) in Elasticsearch with ILM: hot data on NVMe SSD, archive in R2 via searchable snapshots.
**Uber** switched from plain text to structured JSON with ECS fields - incident investigation time dropped from hours to minutes via trace_id correlation across 4000+ microservices.

Elasticsearch

Elasticsearch is a distributed search engine built on Apache Lucene. It stores data in indices, partitioned into shards distributed across nodes. For logs, Index Lifecycle Management (ILM) automatically moves indices through hot-warm-cold tiers.

ILM reduces storage cost by 70-80%: hot data on NVMe SSD, warm data on HDD, cold data as searchable snapshots in S3. Logs are still searchable in S3 but at a fraction of the cost.

A cluster receives 1TB of logs per day. After 7 days in hot tier, indices are moved to warm. What is the primary cost benefit of ILM?

Logstash and Fluent Bit

Logstash is a processing pipeline for logs: input (Filebeat, Kafka, syslog) - filter (grok, mutate, drop) - output (Elasticsearch, S3, Kafka). In Kubernetes, Fluent Bit is preferred as a DaemonSet agent due to its minimal RAM footprint (~10MB vs ~300MB for Logstash).

Fluent Bit vs Logstash: Fluent Bit (10MB, C) for DaemonSet log collection, Logstash (300MB, JVM) for complex transformations (grok, enrichment, multi-output). Common pattern: Fluent Bit on nodes sends to Kafka; Logstash consumes from Kafka.

A Kubernetes cluster runs 200 pods. Each pod writes logs to stdout. What is the correct architecture for collecting these logs?

Kibana

Kibana is the visualization and search layer for Elasticsearch. Key tools: Discover (ad-hoc log search with KQL), Dashboards (real-time metrics), Alerting (trigger PagerDuty on anomalies), Lens (drag-and-drop analytics).

KQL (Kibana Query Language) is field-based: `response.code >= 500` is faster than searching plain text because structured fields are indexed separately. This is why structured logging matters.

An on-call engineer receives a PagerDuty alert about elevated error rate in payment-api. Which Kibana query finds the root cause fastest?

Structured Logging

Structured logging writes logs in a machine-readable format (JSON) instead of plain text. Every field is separately indexed in Elasticsearch, enabling fast filtering, aggregation, and correlation by trace_id.

ECS (Elastic Common Schema) is the field naming standard. Using ECS across all services enables cross-service correlation: `trace.id: abc123` returns log lines from all services in a single request.

Structured logging is just cosmetics - plain text logs contain all the same information

Structured logs are indexed field-by-field, enabling fast queries and aggregations that are impossible with plain text.

Finding all payments over $1000 that failed in the last hour: structured logging takes 1 KQL query in 50ms. Plain text requires a full-text scan with regex, taking 10-60 seconds at scale.

A microservice logs: `[ERROR] User 12345 payment failed: insufficient funds`. What is missing compared to structured logging?

Key Ideas

**Elasticsearch + ILM** - stores logs with automatic lifecycle management: SSD for hot data, S3 for archive, auto-delete after one year.
**Logstash / Fluent Bit** - pipeline for collection and transformation; Fluent Bit is preferred in Kubernetes DaemonSet for minimal RAM footprint.
**Structured logging (JSON + ECS)** - the foundation: every field is indexed, trace_id correlation is possible, KQL queries work without grok parsing.

Вопросы для размышления

How does ILM help control log storage costs when traffic grows 10x?
If a service logs in plain text, what steps are needed to migrate to structured logging without downtime?
When is Grafana Loki preferable to Elasticsearch for centralized logging?