AI Engineering

Observability: Logging, Tracing, Monitoring AI Pipelines

Цели урока

Identify what to log for each LLM call: prompts, tokens, cost, quality, feedback
Implement distributed tracing for LLM chains with parent-child spans
Integrate an LLM observability platform (Helicone, LangFuse, LangSmith)
Set up alerting for quality degradation, error rate spikes, and cost anomalies
Build a cost monitoring dashboard with realtime metrics and budget tracking

Classic backend: logs, metrics, traces - the failure point is obvious. AI backend: model returned 200, tokens are spent, user is unhappy - and it's unclear why. LLM observability is a separate discipline. Standard APM shows HTTP 200 - everything's green. But inside: hallucination, degraded prompt, cost per session tripled. This is the blind spot of standard monitoring.

Langfuse (open-source, 5000+ GitHub stars) - self-hosted LLM observability with traces/spans for every call, cost per session, hallucination rate, and user feedback loop
Helicone - proxy-based: one line change to baseURL, and all prompts, tokens, and costs are logged automatically. Zero code changes
LangSmith from LangChain: 100K+ active projects, de facto standard for LangChain applications with playground and eval datasets
OpenAI Dashboard shows usage and costs but not quality - application-level monitoring is required on top

When LLM Apps Needed Their Own Telemetry

Through **2023**, as teams pushed LLM apps into production, classic metrics and logs proved too coarse: a single user request fanned out into chains of prompts, retrievals and tool calls, and a plain error log said nothing about which step misbehaved or why. The fix borrowed an old idea from microservices - **distributed tracing** - and adapted it to LLMs, later standardized through OpenTelemetry conventions for AI spans. **LangSmith**, launched by the LangChain team in beta in August 2023, made traces of every chain step a first-class object. The same year, Y Combinator's W23 batch produced two more tools that defined the category: **Helicone**, a proxy that logged and tracked requests with a one-line base-URL change, and **Langfuse**, an open-source, self-hostable platform for tracing, prompt management and evaluation. Observability for AI stopped being a hand-rolled wrapper and became its own tooling layer.

Предварительные знания

LangChain and LlamaIndex: Orchestrating LLM Pipelines

What to Log: Prompts, Tokens, Latency, Quality

October 2023. An AI startup discovers response quality dropped 30% over two weeks. Five days of manual log diving. The culprit: one broken few-shot example in the system prompt - OpenAI had updated GPT-4, and the example stopped working. With quality metrics monitoring, an alert would have fired on day one. Instead, 10,000 users got bad answers.

Standard APM (Datadog, New Relic) sees HTTP status, latency, memory. LLM call returned 200 - everything's green. But inside: model is hallucinating, prompt degraded, cost per session tripled. Standard tools are blind to prompt/completion content, token usage, and semantic quality. Application-level monitoring is required.

**Do not log full prompts and responses without encryption.** Prompts may contain user PII, responses may contain confidential information. Options: hashing, encryption at rest, role-based access. GDPR/CCPA require auditing access to personal data.

Standard APM (Datadog, New Relic) covers LLM observability

APM sees HTTP status, latency, errors - but is blind to prompt/completion content, token usage, semantic quality, and cost per session

LLM call returned 200 - APM is satisfied. But inside: model is hallucinating, prompt degraded after a provider update, cost per session tripled. Hallucination rate, p95 latency by request type, user feedback loop, traces/spans for individual chain steps - all of this requires application-level instrumentation that standard APM tools are not designed to provide.

Why is standard APM (Datadog, New Relic) insufficient for LLM observability?

Distributed Tracing for LLM Chains: From Request to Response

An LLM application is not a single call. It's a chain: guardrails -> RAG retrieval -> cache check -> LLM call -> output validation. When something goes wrong, the goal is knowing exactly which step failed. Without tracing - only a total request time and a final status.

Distributed tracing links all steps into a single trace with parent-child spans. Each span knows its position in the chain, its duration, and its attributes. Langfuse, LangSmith, and Helicone build traces exactly this way - traces/spans for LLM calls are their primary data unit.

This is exactly how Langfuse and LangSmith work under the hood. Every `trace.span()` call creates a span, every `trace.generation()` creates a span with LLM-specific attributes (model, tokens, cost). The timeline in the UI is simply a render of this tree.

Why is each step of an LLM chain wrapped in a separate span?

LangSmith, LangFuse, Helicone: Specialized Platforms

Building tracing from scratch is possible. But why, when Langfuse (open-source, 5000+ GitHub stars), Helicone, and LangSmith already solved this? Ready-made infrastructure: automatic trace/span collection for LLM calls, cost per session, p95 latency, hallucination rate, user feedback loop - all out of the box.

Platform	Type	Integration	Pricing	Key Feature
LangSmith	SaaS	LangChain SDK, REST API	Free tier + paid	Deep LangChain integration, playground
LangFuse	Open-source / SaaS	SDK (JS/Python), OpenAI proxy	Self-hosted free, SaaS paid	Self-hosted, GDPR-compliant
Helicone	Proxy	Change baseURL, no SDK needed	Free tier + paid	Zero-code setup, caching proxy
Braintrust	SaaS	SDK, CI/CD integration	Paid	Eval-focused, dataset management
Arize Phoenix	Open-source	OpenTelemetry, SDK	Self-hosted free	ML observability background

Platform selection is an engineering decision, not a preference. **Helicone** - fastest start: proxy, one line, zero code changes. **LangFuse** - open-source, self-hosted, GDPR compliance without data leaving the country. **LangSmith** - for LangChain projects that need playground and eval datasets.

What is Helicone's key advantage over LangFuse for getting started quickly?

Alerting on Quality Degradation

A dashboard without alerting is a pretty picture nobody looks at. Model degradation happens at night. Provider updates land on Friday evenings. Cost anomalies accumulate over weekends. Automatic triggers are essential: eval score dropped, error rate spiked, cost per session surged.

Hallucination rate above baseline, thumbs down above 20%, fallback rate creeping up - each signal demands a different response. Not everything warrants a PagerDuty page at 3am. Severity matters: warning to Slack, critical to PagerDuty.

Alert	Metric	Threshold	Meaning
Quality degradation	quality_score < 3.5/5	30 min	Model is producing poor responses
High latency	p95 > 10s	15 min	API is degrading or overloaded
Error rate spike	error_rate > 5%	10 min	Rate limit, outage, or bug
Cost anomaly	> USD 100 per hour	60 min	DDoS, infinite loop, or spam
Fallback rate	> 10%	30 min	Primary model is unavailable
Negative feedback	thumbs_down > 20%	60 min	Users are dissatisfied

Which alert should be critical (PagerDuty) rather than warning (Slack)?

Cost Monitoring Dashboard: Real-Time Spending Visualization

Cost per session, p95 latency, hallucination rate, user feedback loop - all of this needs to be visible in one place. The cost monitoring dashboard combines financial and quality metrics. The key insight: **projected month-end cost** - a forecast based on current pace. The team sees in advance that spending will exceed the budget and has time to optimize.

Widget	Refresh Rate	Audience
Realtime spend	Every minute	DevOps, on-call
Cost breakdown by model	Every 5 min	Engineering lead
Quality trends	Every 15 min	Product manager
Budget utilization	Every hour	Management
Anomaly detection	Every 10 min	DevOps, on-call

Fallback level breakdown is the first widget to check on a quality degradation alert. High fallback rate (>10% on secondary/template) means the primary model is returning errors. Normal fallback rate but low quality means a content problem: provider model update or a broken prompt.

Which dashboard metric should be checked FIRST when a quality degradation alert fires?

Standard APM (Datadog, New Relic) covers LLM observability

APM sees HTTP status, latency, errors - but is blind to prompt/completion content, token usage, semantic quality, and cost per session

LLM returned 200 - APM is satisfied. But inside: model is hallucinating, prompt degraded after a provider update, cost per session tripled. Hallucination rate, p95 latency by request type, user feedback loop, traces/spans for individual chain steps - all of this requires application-level instrumentation. Standard tools are simply not built for it.

Observability for AI Pipelines

APM sees HTTP metrics - misses prompt/completion content, token usage, semantic quality. Application-level monitoring is required
For each LLM call, log: model, tokens, cost, latency, cache hit, guardrail violations, quality score, user feedback
Distributed tracing: each chain step in a separate span. Timeline visualization instantly shows the bottleneck
Platforms: Helicone (proxy, one line) to start, LangFuse (open-source) for compliance, LangSmith for LangChain
Alerting: quality < 3.5/5, error rate > 5%, cost > USD 100 per hour, fallback rate > 10%, thumbs down > 20%
Dashboard: realtime spend, breakdown by model, budget utilization, projected month-end cost, quality trends

What's Next

Observability completes the Production & Optimization block. Next topics cover fine-tuning and working with open-source models.

Fine-tuning — Observability data (prompts, responses, feedback) serves as source data for fine-tuning
Evaluation — Eval scores are the key quality monitoring metric in observability
Cost Management — Cost monitoring dashboard uses data from the cost management service

Связанные уроки

aie-20-langchain-llamaindex — Frameworks emit the traces we observe
aie-31-evaluation — Traces become datasets for evaluation
aie-29-cost-management — Spans expose per-step token cost
aie-36-fine-tuning — Logged interactions feed fine-tuning data
sd-22-observability — Same logs, metrics, traces triad
sd-10-microservices

AI Engineering

Observability: Logging, Tracing, Monitoring AI Pipelines

Цели урока

Identify what to log for each LLM call: prompts, tokens, cost, quality, feedback
Implement distributed tracing for LLM chains with parent-child spans
Integrate an LLM observability platform (Helicone, LangFuse, LangSmith)
Set up alerting for quality degradation, error rate spikes, and cost anomalies
Build a cost monitoring dashboard with realtime metrics and budget tracking

Langfuse (open-source, 5000+ GitHub stars) - self-hosted LLM observability with traces/spans for every call, cost per session, hallucination rate, and user feedback loop
Helicone - proxy-based: one line change to baseURL, and all prompts, tokens, and costs are logged automatically. Zero code changes
LangSmith from LangChain: 100K+ active projects, de facto standard for LangChain applications with playground and eval datasets
OpenAI Dashboard shows usage and costs but not quality - application-level monitoring is required on top

When LLM Apps Needed Their Own Telemetry

Предварительные знания

LangChain and LlamaIndex: Orchestrating LLM Pipelines

What to Log: Prompts, Tokens, Latency, Quality

Standard APM (Datadog, New Relic) covers LLM observability

APM sees HTTP status, latency, errors - but is blind to prompt/completion content, token usage, semantic quality, and cost per session

Why is standard APM (Datadog, New Relic) insufficient for LLM observability?

Distributed Tracing for LLM Chains: From Request to Response

Why is each step of an LLM chain wrapped in a separate span?

LangSmith, LangFuse, Helicone: Specialized Platforms

Platform	Type	Integration	Pricing	Key Feature
LangSmith	SaaS	LangChain SDK, REST API	Free tier + paid	Deep LangChain integration, playground
LangFuse	Open-source / SaaS	SDK (JS/Python), OpenAI proxy	Self-hosted free, SaaS paid	Self-hosted, GDPR-compliant
Helicone	Proxy	Change baseURL, no SDK needed	Free tier + paid	Zero-code setup, caching proxy
Braintrust	SaaS	SDK, CI/CD integration	Paid	Eval-focused, dataset management
Arize Phoenix	Open-source	OpenTelemetry, SDK	Self-hosted free	ML observability background

What is Helicone's key advantage over LangFuse for getting started quickly?

Alerting on Quality Degradation

Alert	Metric	Threshold	Meaning
Quality degradation	quality_score < 3.5/5	30 min	Model is producing poor responses
High latency	p95 > 10s	15 min	API is degrading or overloaded
Error rate spike	error_rate > 5%	10 min	Rate limit, outage, or bug
Cost anomaly	> USD 100 per hour	60 min	DDoS, infinite loop, or spam
Fallback rate	> 10%	30 min	Primary model is unavailable
Negative feedback	thumbs_down > 20%	60 min	Users are dissatisfied

Which alert should be critical (PagerDuty) rather than warning (Slack)?

Cost Monitoring Dashboard: Real-Time Spending Visualization

Widget	Refresh Rate	Audience
Realtime spend	Every minute	DevOps, on-call
Cost breakdown by model	Every 5 min	Engineering lead
Quality trends	Every 15 min	Product manager
Budget utilization	Every hour	Management
Anomaly detection	Every 10 min	DevOps, on-call

Which dashboard metric should be checked FIRST when a quality degradation alert fires?

Standard APM (Datadog, New Relic) covers LLM observability

APM sees HTTP status, latency, errors - but is blind to prompt/completion content, token usage, semantic quality, and cost per session

Observability for AI Pipelines

APM sees HTTP metrics - misses prompt/completion content, token usage, semantic quality. Application-level monitoring is required
For each LLM call, log: model, tokens, cost, latency, cache hit, guardrail violations, quality score, user feedback
Distributed tracing: each chain step in a separate span. Timeline visualization instantly shows the bottleneck
Platforms: Helicone (proxy, one line) to start, LangFuse (open-source) for compliance, LangSmith for LangChain
Alerting: quality < 3.5/5, error rate > 5%, cost > USD 100 per hour, fallback rate > 10%, thumbs down > 20%
Dashboard: realtime spend, breakdown by model, budget utilization, projected month-end cost, quality trends

What's Next

Observability completes the Production & Optimization block. Next topics cover fine-tuning and working with open-source models.

Fine-tuning — Observability data (prompts, responses, feedback) serves as source data for fine-tuning
Evaluation — Eval scores are the key quality monitoring metric in observability
Cost Management — Cost monitoring dashboard uses data from the cost management service

Связанные уроки

aie-20-langchain-llamaindex — Frameworks emit the traces we observe
aie-31-evaluation — Traces become datasets for evaluation
aie-29-cost-management — Spans expose per-step token cost
aie-36-fine-tuning — Logged interactions feed fine-tuning data
sd-22-observability — Same logs, metrics, traces triad
sd-10-microservices