AI Engineering
Observability: Logging, Tracing, Monitoring AI Pipelines
Цели урока
- Identify what to log for each LLM call: prompts, tokens, cost, quality, feedback
- Implement distributed tracing for LLM chains with parent-child spans
- Integrate an LLM observability platform (Helicone, LangFuse, LangSmith)
- Set up alerting for quality degradation, error rate spikes, and cost anomalies
- Build a cost monitoring dashboard with realtime metrics and budget tracking
Classic backend: logs, metrics, traces - the failure point is obvious. AI backend: model returned 200, tokens are spent, user is unhappy - and it's unclear why. LLM observability is a separate discipline. Standard APM shows HTTP 200 - everything's green. But inside: hallucination, degraded prompt, cost per session tripled. This is the blind spot of standard monitoring.
- Langfuse (open-source, 5000+ GitHub stars) - self-hosted LLM observability with traces/spans for every call, cost per session, hallucination rate, and user feedback loop
- Helicone - proxy-based: one line change to baseURL, and all prompts, tokens, and costs are logged automatically. Zero code changes
- LangSmith from LangChain: 100K+ active projects, de facto standard for LangChain applications with playground and eval datasets
- OpenAI Dashboard shows usage and costs but not quality - application-level monitoring is required on top
When LLM Apps Needed Their Own Telemetry
Through **2023**, as teams pushed LLM apps into production, classic metrics and logs proved too coarse: a single user request fanned out into chains of prompts, retrievals and tool calls, and a plain error log said nothing about which step misbehaved or why. The fix borrowed an old idea from microservices - **distributed tracing** - and adapted it to LLMs, later standardized through OpenTelemetry conventions for AI spans. **LangSmith**, launched by the LangChain team in beta in August 2023, made traces of every chain step a first-class object. The same year, Y Combinator's W23 batch produced two more tools that defined the category: **Helicone**, a proxy that logged and tracked requests with a one-line base-URL change, and **Langfuse**, an open-source, self-hostable platform for tracing, prompt management and evaluation. Observability for AI stopped being a hand-rolled wrapper and became its own tooling layer.
Предварительные знания
What to Log: Prompts, Tokens, Latency, Quality
Classic backend: logs, metrics, traces - the failure point is obvious. AI backend: model returned 200, tokens are spent, user is unhappy - and it's unclear why. LLM observability is a separate discipline.
October 2023. An AI startup discovers response quality dropped 30% over two weeks. Five days of manual log diving. The culprit: one broken few-shot example in the system prompt - OpenAI had updated GPT-4, and the example stopped working. With quality metrics monitoring, an alert would have fired on day one. Instead, 10,000 users got bad answers.
Standard APM (Datadog, New Relic) sees HTTP status, latency, memory. LLM call returned 200 - everything's green. But inside: model is hallucinating, prompt degraded, cost per session tripled. Standard tools are blind to prompt/completion content, token usage, and semantic quality. Application-level monitoring is required.
**Do not log full prompts and responses without encryption.** Prompts may contain user PII, responses may contain confidential information. Options: hashing, encryption at rest, role-based access. GDPR/CCPA require auditing access to personal data.
Standard APM (Datadog, New Relic) covers LLM observability
APM sees HTTP status, latency, errors - but is blind to prompt/completion content, token usage, semantic quality, and cost per session
LLM call returned 200 - APM is satisfied. But inside: model is hallucinating, prompt degraded after a provider update, cost per session tripled. Hallucination rate, p95 latency by request type, user feedback loop, traces/spans for individual chain steps - all of this requires application-level instrumentation that standard APM tools are not designed to provide.
Why is standard APM (Datadog, New Relic) insufficient for LLM observability?
Distributed Tracing for LLM Chains: From Request to Response
An LLM application is not a single call. It's a chain: guardrails -> RAG retrieval -> cache check -> LLM call -> output validation. When something goes wrong, the goal is knowing exactly which step failed. Without tracing - only a total request time and a final status.
Distributed tracing links all steps into a single trace with parent-child spans. Each span knows its position in the chain, its duration, and its attributes. Langfuse, LangSmith, and Helicone build traces exactly this way - traces/spans for LLM calls are their primary data unit.
This is exactly how Langfuse and LangSmith work under the hood. Every `trace.span()` call creates a span, every `trace.generation()` creates a span with LLM-specific attributes (model, tokens, cost). The timeline in the UI is simply a render of this tree.
Why is each step of an LLM chain wrapped in a separate span?
LangSmith, LangFuse, Helicone: Specialized Platforms
Building tracing from scratch is possible. But why, when Langfuse (open-source, 5000+ GitHub stars), Helicone, and LangSmith already solved this? Ready-made infrastructure: automatic trace/span collection for LLM calls, cost per session, p95 latency, hallucination rate, user feedback loop - all out of the box.
| Platform | Type | Integration | Pricing | Key Feature |
|---|---|---|---|---|
| LangSmith | SaaS | LangChain SDK, REST API | Free tier + paid | Deep LangChain integration, playground |
| LangFuse | Open-source / SaaS | SDK (JS/Python), OpenAI proxy | Self-hosted free, SaaS paid | Self-hosted, GDPR-compliant |
| Helicone | Proxy | Change baseURL, no SDK needed | Free tier + paid | Zero-code setup, caching proxy |
| Braintrust | SaaS | SDK, CI/CD integration | Paid | Eval-focused, dataset management |
| Arize Phoenix | Open-source | OpenTelemetry, SDK | Self-hosted free | ML observability background |
Platform selection is an engineering decision, not a preference. **Helicone** - fastest start: proxy, one line, zero code changes. **LangFuse** - open-source, self-hosted, GDPR compliance without data leaving the country. **LangSmith** - for LangChain projects that need playground and eval datasets.
What is Helicone's key advantage over LangFuse for getting started quickly?
Alerting on Quality Degradation
A dashboard without alerting is a pretty picture nobody looks at. Model degradation happens at night. Provider updates land on Friday evenings. Cost anomalies accumulate over weekends. Automatic triggers are essential: eval score dropped, error rate spiked, cost per session surged.
Hallucination rate above baseline, thumbs down above 20%, fallback rate creeping up - each signal demands a different response. Not everything warrants a PagerDuty page at 3am. Severity matters: warning to Slack, critical to PagerDuty.
| Alert | Metric | Threshold | Meaning |
|---|---|---|---|
| Quality degradation | quality_score < 3.5/5 | 30 min | Model is producing poor responses |
| High latency | p95 > 10s | 15 min | API is degrading or overloaded |
| Error rate spike | error_rate > 5% | 10 min | Rate limit, outage, or bug |
| Cost anomaly | > USD 100 per hour | 60 min | DDoS, infinite loop, or spam |
| Fallback rate | > 10% | 30 min | Primary model is unavailable |
| Negative feedback | thumbs_down > 20% | 60 min | Users are dissatisfied |
Which alert should be critical (PagerDuty) rather than warning (Slack)?
Cost Monitoring Dashboard: Real-Time Spending Visualization
Cost per session, p95 latency, hallucination rate, user feedback loop - all of this needs to be visible in one place. The cost monitoring dashboard combines financial and quality metrics. The key insight: **projected month-end cost** - a forecast based on current pace. The team sees in advance that spending will exceed the budget and has time to optimize.
| Widget | Refresh Rate | Audience |
|---|---|---|
| Realtime spend | Every minute | DevOps, on-call |
| Cost breakdown by model | Every 5 min | Engineering lead |
| Quality trends | Every 15 min | Product manager |
| Budget utilization | Every hour | Management |
| Anomaly detection | Every 10 min | DevOps, on-call |
Fallback level breakdown is the first widget to check on a quality degradation alert. High fallback rate (>10% on secondary/template) means the primary model is returning errors. Normal fallback rate but low quality means a content problem: provider model update or a broken prompt.
Which dashboard metric should be checked FIRST when a quality degradation alert fires?
Standard APM (Datadog, New Relic) covers LLM observability
APM sees HTTP status, latency, errors - but is blind to prompt/completion content, token usage, semantic quality, and cost per session
LLM returned 200 - APM is satisfied. But inside: model is hallucinating, prompt degraded after a provider update, cost per session tripled. Hallucination rate, p95 latency by request type, user feedback loop, traces/spans for individual chain steps - all of this requires application-level instrumentation. Standard tools are simply not built for it.
Observability for AI Pipelines
- APM sees HTTP metrics - misses prompt/completion content, token usage, semantic quality. Application-level monitoring is required
- For each LLM call, log: model, tokens, cost, latency, cache hit, guardrail violations, quality score, user feedback
- Distributed tracing: each chain step in a separate span. Timeline visualization instantly shows the bottleneck
- Platforms: Helicone (proxy, one line) to start, LangFuse (open-source) for compliance, LangSmith for LangChain
- Alerting: quality < 3.5/5, error rate > 5%, cost > USD 100 per hour, fallback rate > 10%, thumbs down > 20%
- Dashboard: realtime spend, breakdown by model, budget utilization, projected month-end cost, quality trends
What's Next
Observability completes the Production & Optimization block. Next topics cover fine-tuning and working with open-source models.
- Fine-tuning — Observability data (prompts, responses, feedback) serves as source data for fine-tuning
- Evaluation — Eval scores are the key quality monitoring metric in observability
- Cost Management — Cost monitoring dashboard uses data from the cost management service
Связанные уроки
- aie-20-langchain-llamaindex — Frameworks emit the traces we observe
- aie-31-evaluation — Traces become datasets for evaluation
- aie-29-cost-management — Spans expose per-step token cost
- aie-36-fine-tuning — Logged interactions feed fine-tuning data
- sd-22-observability — Same logs, metrics, traces triad
- sd-10-microservices