OpenTelemetry for AI Applications: Observability When Your Stack Thinks for Itself

You deploy a web app. Something gets slow. You open your traces, find the slow database query, add an index, and ship the fix. This workflow is so well-established that most teams do it without thinking.

Now you deploy an AI-powered app. Something goes wrong. A user gets a bizarre response. Your cost per request doubled overnight. A feature that worked last week started producing outputs that don’t make sense. You open your traces.

Your traces show a single HTTP call to api.openai.com that returned 200. That’s it. The black box stayed black.

This is the observability gap most teams discover the hard way after they move AI from demo to production.

What Changes With AI in the Stack

Traditional web services have a comforting property: given the same inputs, they produce the same outputs. Debugging is mostly about finding which input triggered which code path. Logs, metrics, and traces work well for this.

LLMs are different in three ways that break standard observability patterns:

Non-determinism. The same prompt can produce different outputs across calls. A failing user session might not be reproducible, even if you replay the exact request.

Prompt dependency. The application’s behavior depends heavily on the prompt, not just the code. Two apps with identical code but different prompts behave completely differently. Most observability tools treat the prompt as an opaque string in the request body.

Cost as a first-class signal. Token usage is both a correctness signal and a cost signal. A request that uses 10x more tokens than expected is often a sign of something going wrong — a runaway loop, an oversized context, or a prompt that’s generating excessive output.

OpenTelemetry’s Role

OpenTelemetry (OTel) is the CNCF standard for collecting telemetry data — traces, metrics, and logs — from your applications. It’s vendor-neutral, so the data you collect can go to Grafana, Jaeger, Honeycomb, Datadog, or any other backend.

What makes it useful for AI applications is the ability to create custom spans and attributes. A standard HTTP trace tells you the call succeeded in 800ms. A custom OTel span tells you: call succeeded in 800ms, prompt was 1,240 tokens, completion was 340 tokens, model was gpt-4o, cost was $0.021, and the response included a tool call to search_database.

The difference between those two data points is the difference between knowing a request happened and understanding what your application actually did.

Instrumentation Basics

Here’s a minimal setup for a Python FastAPI service calling an LLM:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-ai-app")

# Wrap LLM calls
def call_llm(prompt: str, model: str = "gpt-4o") -> dict:
    with tracer.start_as_current_span("llm.completion") as span:
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.prompt_length", len(prompt))
        span.set_attribute("llm.prompt_preview", prompt[:200])  # First 200 chars

        response = openai_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )

        usage = response.usage
        span.set_attribute("llm.prompt_tokens", usage.prompt_tokens)
        span.set_attribute("llm.completion_tokens", usage.completion_tokens)
        span.set_attribute("llm.total_tokens", usage.total_tokens)
        span.set_attribute("llm.finish_reason", response.choices[0].finish_reason)

        # Cost estimation (update pricing as models change)
        cost = (usage.prompt_tokens * 0.000005) + (usage.completion_tokens * 0.000015)
        span.set_attribute("llm.estimated_cost_usd", cost)

        return response

This gives you spans you can actually reason about. When something goes wrong, you can look at the trace and see: the prompt was 3,400 tokens (large), the finish reason was length (the model hit the token limit before finishing), and the cost was $0.12 (high for a single call). You now know what to investigate.

What to Instrument in AI Applications

Beyond raw LLM calls, there are several additional layers worth instrumenting:

RAG Pipeline

If you’re using retrieval-augmented generation, instrument each stage separately:

with tracer.start_as_current_span("rag.retrieve") as span:
    span.set_attribute("rag.query", user_query)
    span.set_attribute("rag.collection", "product_docs")
    
    results = vector_store.similarity_search(user_query, k=5)
    
    span.set_attribute("rag.results_count", len(results))
    span.set_attribute("rag.top_score", results[0].score if results else 0)
    span.set_attribute("rag.total_tokens", sum(r.token_count for r in results))

This surfaces common RAG failure modes: retrieval returning zero results, low similarity scores, context windows being flooded by long documents.

Tool/Function Calls

Agentic applications call tools — database queries, API calls, web searches. Trace each one:

with tracer.start_as_current_span("agent.tool_call") as span:
    span.set_attribute("tool.name", tool_name)
    span.set_attribute("tool.input", str(tool_input)[:500])
    span.set_attribute("tool.call_number", call_count)  # Track loops

Track call_number to catch infinite loops — a common agentic failure where the model keeps calling the same tool without making progress.

Response Quality Metrics

If you have a way to evaluate response quality (a scoring function, user feedback, a guard model), record those scores as spans or metrics:

span.set_attribute("response.quality_score", quality_score)
span.set_attribute("response.contains_hallucination", has_hallucination)
span.set_attribute("response.user_rating", user_rating)  # If you collect feedback

Quality signals tied to traces let you correlate response problems with specific prompts, models, and contexts.

Key Metrics to Track

Beyond traces, set up these metrics as time-series in your metrics backend:

Metric	Type	Why It Matters
`llm.tokens.prompt`	Histogram	Detect prompt bloat, context window abuse
`llm.tokens.completion`	Histogram	Catch runaway generation
`llm.latency_ms`	Histogram	SLO tracking, latency regression detection
`llm.cost_usd`	Counter	Daily cost budget alerts
`llm.error_rate`	Rate	Model API failures, rate limiting
`llm.finish_reason`	Counter by reason	`length` = token limit hit; `content_filter` = moderation
`rag.retrieval_score`	Histogram	Retrieval quality degradation
`agent.tool_calls_per_session`	Histogram	Loop detection

Alert on llm.cost_usd exceeding your daily budget, llm.error_rate above threshold, and agent.tool_calls_per_session above a maximum (a reasonable cap for most applications is 20 tool calls per session before something is clearly wrong).

The Semantic Conventions Problem

One challenge with LLM observability right now: there’s no settled standard for what attributes to put on LLM spans. OpenTelemetry is actively developing semantic conventions for generative AI (gen_ai.* prefix), but the spec is still evolving as of mid-2026.

The practical consequence: if you use a library like LangChain, LlamaIndex, or the OpenAI Python SDK, they may produce spans with different attribute names than what you define manually. Your dashboards may need to normalize across attribute naming schemes.

For new projects, use the emerging gen_ai.* conventions where they exist, and prefix your custom attributes consistently (llm.* or your own namespace) to distinguish them.

Libraries That Help

You don’t have to instrument everything from scratch. Several libraries have done the work for common stacks:

OpenLLMetry by Traceloop: Drop-in OTel instrumentation for OpenAI, Anthropic, LangChain, and others. One import adds spans to every LLM call.
Langfuse: Purpose-built LLM observability with traces, evaluations, and a dataset management layer. Has an OTel-compatible export path.
Arize Phoenix: OSS tracing UI designed for LLM apps, accepts OTel data.

# OpenLLMetry: one-line instrumentation
from traceloop.sdk import Traceloop
Traceloop.init(app_name="my-ai-app")

# That's it. LLM calls are now traced automatically.

These save time, but read the spans they generate before trusting them for production alerting. The auto-instrumented spans may not capture everything you need.

Where to Start

If you’re running an AI application in production with no LLM-specific observability today, the highest-value first step is simple: log token counts and finish reasons for every LLM call, alongside your normal request logs.

Token counts tell you about prompt size and cost trends. Finish reasons tell you whether your model is completing normally, hitting limits, or being stopped by content filters. Together, they surface 80% of the production issues teams encounter.

From there, add traces to separate your RAG retrieval from your generation step. Understanding where latency lives in your pipeline is the second most useful thing you can know.

Full distributed traces with quality metrics, cost dashboards, and evaluation pipelines come later, once you know which signals actually matter for your specific application.

OpenTelemetry for AI Applications: Observability When Your Stack Thinks for Itself

What Changes With AI in the Stack

OpenTelemetry’s Role

Instrumentation Basics

What to Instrument in AI Applications

RAG Pipeline

Tool/Function Calls

Response Quality Metrics

Key Metrics to Track

The Semantic Conventions Problem

Libraries That Help

Where to Start

Fixed Price vs Time and Materials: The Contract Decision That Shapes Every Project

Prompt Injection in 2026: The Attack Your AI App Probably Isn't Defending Against

More from Cloud & Infrastructure

GitHub Actions in 2026: Faster Pipelines, Smaller Bills

Terraform vs OpenTofu vs Pulumi: Picking Your IaC Tool in 2026

Valkey in 2026: What Happened When Redis Changed Its License

Working notes from
the studio.

Join the conversation.

What Changes With AI in the Stack

OpenTelemetry’s Role

Instrumentation Basics

What to Instrument in AI Applications

RAG Pipeline

Tool/Function Calls

Response Quality Metrics

Key Metrics to Track

The Semantic Conventions Problem

Libraries That Help

Where to Start

Fixed Price vs Time and Materials: The Contract Decision That Shapes Every Project

Prompt Injection in 2026: The Attack Your AI App Probably Isn't Defending Against

More from Cloud & Infrastructure

GitHub Actions in 2026: Faster Pipelines, Smaller Bills

Terraform vs OpenTofu vs Pulumi: Picking Your IaC Tool in 2026

Valkey in 2026: What Happened When Redis Changed Its License

Working notes fromthe studio.

Join the conversation.

Working notes from
the studio.