LLM Observability & Tracing: Seeing Inside the Black Box

A user reports your AI assistant gave a completely wrong answer about their subscription plan. You check the logs: “200 OK, response generated, 1.2s latency.” Everything looks fine. But you have no idea what prompt was sent, what context was retrieved, what the model actually generated, or why it was wrong. Traditional application monitoring is blind to AI-specific failures.

LLM observability captures the full trace: user query → prompt assembly → retrieval results → model input → model output → post-processing → final response. When something goes wrong, you can pinpoint: was it bad retrieval? Wrong context? Model hallucination? Prompt engineering issue? Without this visibility, debugging AI applications is guessing in the dark.

What LLM observability captures

graph TD
  subgraph trace["Full LLM Trace"]
      T1["User Input
'What plan am I on?'"]
      T2["Retrieval
Query: embed → search
Results: 3 chunks retrieved"]
      T3["Prompt Assembly
System + context + query
Total: 2,400 tokens"]
      T4["Model Call
Model: gpt-4o, temp: 0.3
Latency: 890ms, tokens: 342 out"]
      T5["Post-Processing
Guardrails: pass
Formatting: applied"]
      T6["Response
'You are on the Pro plan...'"]
  end

  T1 --> T2 --> T3 --> T4 --> T5 --> T6

  style T1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style T2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style T4 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style T6 fill:#EEEDFE,stroke:#534AB7,color:#3C3489

Key metrics to track

Operational metrics

Latency (TTFT, total generation time, per-component breakdown)
Token usage (input/output per request, daily totals, per-feature)
Error rates (API failures, timeouts, rate limits)
Cost (per-request, per-user, per-feature)

Quality metrics

Retrieval precision (are retrieved chunks relevant?)
Hallucination rate (claims not supported by context)
User satisfaction (thumbs up/down, task completion)
Instruction compliance (does output match requested format?)

Trace-level data

Full prompt (system + user + context)
Retrieved documents with relevance scores
Model response (raw and processed)
Tool calls and results
Latency per component

Implementation

from langsmith import traceable

@traceable(name="customer_support_query")
async def handle_query(user_id: str, query: str):
    # Each step is automatically traced
    with trace_span("retrieval"):
        chunks = await retrieve_context(query)
    
    with trace_span("prompt_assembly"):
        prompt = assemble_prompt(system_prompt, chunks, query)
        log_metadata({"token_count": count_tokens(prompt), "chunks": len(chunks)})
    
    with trace_span("model_call"):
        response = await llm.generate(prompt)
        log_metadata({"model": "gpt-4o", "tokens_out": response.usage.completion_tokens})
    
    with trace_span("post_processing"):
        final = apply_guardrails(response.content)
    
    return final

The debugging workflow

When a user reports a bad answer:

Find the trace by user ID, timestamp, or request ID
Check retrieval - were the right documents found? If not: chunking/embedding issue
Check prompt - was context properly included? Were instructions clear?
Check model output - did the model follow instructions? Did it hallucinate?
Check post-processing - did guardrails modify a correct answer? Did formatting break?

This narrows the bug from “AI gave wrong answer” to “retrieval returned outdated document from 2022 instead of current version” - actionable and fixable.

Observability tools

LangSmith - first-party LangChain observability, traces through chains/agents
Braintrust - logging, evals, and prompt management with production tracing
Helicone - proxy-based observability (drop-in, no code changes)
Arize Phoenix - open-source tracing with embedding drift detection
OpenTelemetry + custom spans - DIY approach using standard observability infra
Langfuse - open-source alternative to LangSmith with session tracking

How to apply in practice

Instrument from day one. Adding observability after production issues is reactive. Build it in during development so you have baseline data before problems occur.

Sample in production, capture everything in development. 100% tracing in production is expensive. Sample 10-20% of production traffic for ongoing monitoring, but trace 100% during development and when investigating issues.

Alert on quality drift. Track rolling averages of user satisfaction, hallucination rate, and retrieval precision. Alert when these deviate from baseline by >10%.

Build feedback loops. Connect user thumbs-down signals to the specific trace. Review low-rated traces weekly to identify systematic issues.

FAQ

Q: Is not logging full prompts a privacy concern?

Yes. Prompts may contain user PII, business data, or sensitive context. Implement: PII redaction in traces, access controls on trace data, retention policies (delete traces after 30-90 days), and encryption at rest. Never log traces to shared systems without privacy review.

Q: How much does observability add to latency and cost?

Minimal. Logging metadata and spans adds <5ms per request. The storage cost for traces is typically 1-5% of your LLM API spend. The debugging time it saves on one incident pays for months of observability costs.

Interview questions

Q: Your LLM application’s quality suddenly degrades - user satisfaction drops 20% over a week. How do you use observability to diagnose the root cause?

Systematic diagnosis: (1) Check operational metrics first - any provider errors, latency spikes, or rate limiting? (2) Compare traces from the good period vs bad period. Look for: different model version (providers silently update), changed retrieval quality (new documents indexed poorly?), prompt changes (was the system prompt modified?). (3) Sample 20 low-rated traces and categorize failures: hallucinations? Wrong retrieval? Format issues? (4) If retrieval degraded: check embedding model, vector DB health, index staleness. If model output degraded: check if provider changed model weights. If prompt issue: check version control for system prompt changes. The observability data narrows an ambiguous “quality dropped” to a specific, fixable cause.