Skip to content

Cloud & Infrastructure · Observability

OpenTelemetry for AI Applications: Observability When Your Stack Thinks for Itself

Traditional monitoring tells you a request took 800ms. It doesn't tell you the LLM spent 600ms on a bad prompt, returned a hallucinated answer, and burned $0.04 in tokens. Here's how to actually instrument AI applications with OpenTelemetry.

Anurag Verma

Anurag Verma

7 min read

OpenTelemetry for AI Applications: Observability When Your Stack Thinks for Itself

Sponsored

Share

You deploy a web app. Something gets slow. You open your traces, find the slow database query, add an index, and ship the fix. This workflow is so well-established that most teams do it without thinking.

Now you deploy an AI-powered app. Something goes wrong. A user gets a bizarre response. Your cost per request doubled overnight. A feature that worked last week started producing outputs that don’t make sense. You open your traces.

Your traces show a single HTTP call to api.openai.com that returned 200. That’s it. The black box stayed black.

This is the observability gap most teams discover the hard way after they move AI from demo to production.

What Changes With AI in the Stack

Traditional web services have a comforting property: given the same inputs, they produce the same outputs. Debugging is mostly about finding which input triggered which code path. Logs, metrics, and traces work well for this.

LLMs are different in three ways that break standard observability patterns:

Non-determinism. The same prompt can produce different outputs across calls. A failing user session might not be reproducible, even if you replay the exact request.

Prompt dependency. The application’s behavior depends heavily on the prompt, not just the code. Two apps with identical code but different prompts behave completely differently. Most observability tools treat the prompt as an opaque string in the request body.

Cost as a first-class signal. Token usage is both a correctness signal and a cost signal. A request that uses 10x more tokens than expected is often a sign of something going wrong — a runaway loop, an oversized context, or a prompt that’s generating excessive output.

OpenTelemetry’s Role

OpenTelemetry (OTel) is the CNCF standard for collecting telemetry data — traces, metrics, and logs — from your applications. It’s vendor-neutral, so the data you collect can go to Grafana, Jaeger, Honeycomb, Datadog, or any other backend.

What makes it useful for AI applications is the ability to create custom spans and attributes. A standard HTTP trace tells you the call succeeded in 800ms. A custom OTel span tells you: call succeeded in 800ms, prompt was 1,240 tokens, completion was 340 tokens, model was gpt-4o, cost was $0.021, and the response included a tool call to search_database.

The difference between those two data points is the difference between knowing a request happened and understanding what your application actually did.

Instrumentation Basics

Here’s a minimal setup for a Python FastAPI service calling an LLM:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-ai-app")

# Wrap LLM calls
def call_llm(prompt: str, model: str = "gpt-4o") -> dict:
    with tracer.start_as_current_span("llm.completion") as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.prompt_length", len(prompt))
        span.set_attribute("llm.prompt_preview", prompt[:200])  # First 200 chars

        response = openai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )

        usage = response.usage
        span.set_attribute("llm.prompt_tokens", usage.prompt_tokens)
        span.set_attribute("llm.completion_tokens", usage.completion_tokens)
        span.set_attribute("llm.total_tokens", usage.total_tokens)
        span.set_attribute("llm.finish_reason", response.choices[0].finish_reason)

        # Cost estimation (update pricing as models change)
        cost = (usage.prompt_tokens * 0.000005) + (usage.completion_tokens * 0.000015)
        span.set_attribute("llm.estimated_cost_usd", cost)

        return response

This gives you spans you can actually reason about. When something goes wrong, you can look at the trace and see: the prompt was 3,400 tokens (large), the finish reason was length (the model hit the token limit before finishing), and the cost was $0.12 (high for a single call). You now know what to investigate.

What to Instrument in AI Applications

Beyond raw LLM calls, there are several additional layers worth instrumenting:

RAG Pipeline

If you’re using retrieval-augmented generation, instrument each stage separately:

with tracer.start_as_current_span("rag.retrieve") as span:
    span.set_attribute("rag.query", user_query)
    span.set_attribute("rag.collection", "product_docs")
    
    results = vector_store.similarity_search(user_query, k=5)
    
    span.set_attribute("rag.results_count", len(results))
    span.set_attribute("rag.top_score", results[0].score if results else 0)
    span.set_attribute("rag.total_tokens", sum(r.token_count for r in results))

This surfaces common RAG failure modes: retrieval returning zero results, low similarity scores, context windows being flooded by long documents.

Tool/Function Calls

Agentic applications call tools — database queries, API calls, web searches. Trace each one:

with tracer.start_as_current_span("agent.tool_call") as span:
    span.set_attribute("tool.name", tool_name)
    span.set_attribute("tool.input", str(tool_input)[:500])
    span.set_attribute("tool.call_number", call_count)  # Track loops

Track call_number to catch infinite loops — a common agentic failure where the model keeps calling the same tool without making progress.

Response Quality Metrics

If you have a way to evaluate response quality (a scoring function, user feedback, a guard model), record those scores as spans or metrics:

span.set_attribute("response.quality_score", quality_score)
span.set_attribute("response.contains_hallucination", has_hallucination)
span.set_attribute("response.user_rating", user_rating)  # If you collect feedback

Quality signals tied to traces let you correlate response problems with specific prompts, models, and contexts.

Key Metrics to Track

Beyond traces, set up these metrics as time-series in your metrics backend:

MetricTypeWhy It Matters
llm.tokens.promptHistogramDetect prompt bloat, context window abuse
llm.tokens.completionHistogramCatch runaway generation
llm.latency_msHistogramSLO tracking, latency regression detection
llm.cost_usdCounterDaily cost budget alerts
llm.error_rateRateModel API failures, rate limiting
llm.finish_reasonCounter by reasonlength = token limit hit; content_filter = moderation
rag.retrieval_scoreHistogramRetrieval quality degradation
agent.tool_calls_per_sessionHistogramLoop detection

Alert on llm.cost_usd exceeding your daily budget, llm.error_rate above threshold, and agent.tool_calls_per_session above a maximum (a reasonable cap for most applications is 20 tool calls per session before something is clearly wrong).

The Semantic Conventions Problem

One challenge with LLM observability right now: there’s no settled standard for what attributes to put on LLM spans. OpenTelemetry is actively developing semantic conventions for generative AI (gen_ai.* prefix), but the spec is still evolving as of mid-2026.

The practical consequence: if you use a library like LangChain, LlamaIndex, or the OpenAI Python SDK, they may produce spans with different attribute names than what you define manually. Your dashboards may need to normalize across attribute naming schemes.

For new projects, use the emerging gen_ai.* conventions where they exist, and prefix your custom attributes consistently (llm.* or your own namespace) to distinguish them.

Libraries That Help

You don’t have to instrument everything from scratch. Several libraries have done the work for common stacks:

  • OpenLLMetry by Traceloop: Drop-in OTel instrumentation for OpenAI, Anthropic, LangChain, and others. One import adds spans to every LLM call.
  • Langfuse: Purpose-built LLM observability with traces, evaluations, and a dataset management layer. Has an OTel-compatible export path.
  • Arize Phoenix: OSS tracing UI designed for LLM apps, accepts OTel data.
# OpenLLMetry: one-line instrumentation
from traceloop.sdk import Traceloop
Traceloop.init(app_name="my-ai-app")

# That's it. LLM calls are now traced automatically.

These save time, but read the spans they generate before trusting them for production alerting. The auto-instrumented spans may not capture everything you need.

Where to Start

If you’re running an AI application in production with no LLM-specific observability today, the highest-value first step is simple: log token counts and finish reasons for every LLM call, alongside your normal request logs.

Token counts tell you about prompt size and cost trends. Finish reasons tell you whether your model is completing normally, hitting limits, or being stopped by content filters. Together, they surface 80% of the production issues teams encounter.

From there, add traces to separate your RAG retrieval from your generation step. Understanding where latency lives in your pipeline is the second most useful thing you can know.

Full distributed traces with quality metrics, cost dashboards, and evaluation pipelines come later, once you know which signals actually matter for your specific application.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored