Skip to content

AI Integration · AI Operations

LLM Observability in 2026: What to Track and Which Tools to Use

Building an AI feature is only half the work. Once it's in production, you need to know when it's drifting, what it's costing, and where it's failing. Here's how to instrument LLM applications properly.

Anurag Verma

Anurag Verma

7 min read

LLM Observability in 2026: What to Track and Which Tools to Use

Sponsored

Share

You shipped the AI feature. Users are using it. Then three weeks later a user reports it gave wrong information. You look in your logs. You see the API call succeeded and got a 200. You have no idea what was actually in the prompt, what the model returned, how long it took, or what it cost.

That’s the gap LLM observability tools fill. They sit between your application and the model API, recording every input, output, latency, and cost. When something breaks or degrades, you can replay the exact call, see the full context window, and trace the problem.

This is different from general application observability (APM, distributed tracing). Those tools can tell you that a function was slow. LLM observability tells you that the function was slow because the prompt included 40,000 tokens when it should have had 8,000.

What You Need to Track

Before picking a tool, understand what signals matter for an LLM application.

Traces. A trace is a single end-to-end request through your application. For an AI feature, it typically contains multiple spans: the retrieval step, the context assembly step, the actual LLM call, and post-processing. Traces let you see not just whether the overall request was slow, but which step was responsible.

Token counts and costs. The LLM API bills per token. Without tracking this, you’re flying blind on the actual cost per user request. Token counts also signal problems. A context that suddenly balloons in size usually means your retrieval is pulling too many chunks.

Latency by component. Overall latency is less useful than knowing that your vector search is fast but the reranking step is slow on 15% of queries.

Output quality. Hard to measure automatically, but critical. Did the model follow instructions? Did it stay within the expected topic domain? Did it refuse when it should have answered? You need a way to mark outputs as good or bad and query that data.

Prompt versions. When you change a prompt, you need to know whether the change improved or degraded quality. Without versioned prompts tied to production traces, you’re comparing anecdotes.

Error rates. API errors (rate limits, timeouts, context length exceeded) are different from semantic errors (model answered wrong). Track both separately.

Langfuse

Langfuse is open source and can be self-hosted. It’s the most commonly used option for teams that want control over their data or have compliance requirements that prevent sending conversation data to a third party.

Setup for a Node.js application:

npm install langfuse
import Langfuse from 'langfuse';

const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  baseUrl: process.env.LANGFUSE_BASE_URL, // your self-hosted instance or cloud
});

Instrumenting a basic LLM call:

async function answerQuestion(userId: string, question: string) {
  const trace = langfuse.trace({
    name: 'question-answer',
    userId,
    input: { question },
  });

  // Retrieval span
  const retrievalSpan = trace.span({
    name: 'retrieval',
    input: { query: question },
  });

  const chunks = await vectorSearch(question);

  retrievalSpan.end({
    output: { chunkCount: chunks.length },
  });

  // LLM call generation span
  const generation = trace.generation({
    name: 'answer-generation',
    model: 'gpt-4o',
    input: [
      { role: 'system', content: SYSTEM_PROMPT },
      { role: 'user', content: buildPrompt(question, chunks) },
    ],
  });

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: SYSTEM_PROMPT },
      { role: 'user', content: buildPrompt(question, chunks) },
    ],
  });

  const answer = response.choices[0].message.content;

  generation.end({
    output: answer,
    usage: {
      promptTokens: response.usage?.prompt_tokens,
      completionTokens: response.usage?.completion_tokens,
    },
  });

  trace.update({ output: { answer } });

  await langfuse.flushAsync();
  return answer;
}

The Langfuse dashboard shows each trace with its spans, token counts, and cost breakdown. You can filter by user ID, date range, or latency percentile. You can also add scores to traces manually or programmatically, which lets you tie user feedback to specific conversations.

Prompt management. Langfuse includes a prompt registry where you can version your prompts:

// Fetch a prompt version from the registry
const prompt = await langfuse.getPrompt('answer-system-prompt', 2);
const systemPrompt = prompt.compile({ context: 'support' });

Changing a prompt version in the registry changes it for every deployment without a code deploy. You can compare output quality between versions in the dashboard.

Helicone

Helicone takes a different architecture: it works as a proxy rather than an SDK. You change your base URL and every call flows through Helicone’s infrastructure, which records the request and response.

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: 'https://oai.helicone.ai/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
    'Helicone-User-Id': userId,
    'Helicone-Property-Feature': 'customer-support',
  },
});

// All subsequent calls are automatically logged
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [/* ... */],
});

The proxy approach means you get logging with zero changes to your application logic. The downside is that your LLM calls route through Helicone’s servers, which adds latency (typically a few milliseconds) and raises the same data privacy considerations as any third-party service.

Helicone includes rate limiting, cost tracking, and caching features that can reduce your API spend. The cache is particularly useful for development: repeated calls with identical prompts return cached responses without hitting the API.

// Enable caching for this request
const response = await openai.chat.completions.create(
  {
    model: 'gpt-4o',
    messages: [{ role: 'user', content: question }],
  },
  {
    headers: {
      'Helicone-Cache-Enabled': 'true',
      'Helicone-Cache-Bucket-Max-Size': '100',
    },
  }
);

Setting Up User Feedback Loops

The most useful signal in any LLM observability setup is user feedback: did the user find the answer helpful? This ties directly to trace IDs.

With Langfuse:

// After the user submits feedback (e.g., a thumbs up/down)
await langfuse.score({
  traceId: trace.id,
  name: 'user-rating',
  value: 1, // 1 for positive, 0 for negative
  comment: userComment,
});

With Helicone:

await fetch(`https://api.helicone.ai/v1/request/${requestId}/feedback`, {
  method: 'PATCH',
  headers: {
    Authorization: `Bearer ${process.env.HELICONE_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ rating: true }),
});

Once you have a corpus of rated traces, you can:

  • Compare prompt versions by average user rating
  • Identify the prompts that produce the most negative feedback
  • Find correlations between token count and rating
  • Build an evaluation dataset from high-confidence positive and negative examples

That last point matters: most teams eventually want to run automated evaluation. Your production traces with user feedback become the ground truth.

What to Alert On

Setting up alerts before you need them:

  • p99 latency above threshold: LLM calls that take longer than a reasonable ceiling usually indicate a context length problem or API degradation
  • Cost per request spiking: A sudden cost increase per request often means a retrieval bug is pulling too many chunks
  • Error rate above baseline: Rate limit errors need infrastructure attention; context length errors often indicate a prompt logic bug
  • Token count per request increasing week-over-week: Context drift (the accumulated context is growing without bounds)

Most observability tools export metrics to Prometheus or offer webhook alerts. Wire at least cost and error rate into your existing alerting infrastructure.

Self-Hosted vs Cloud

The decision is mostly about data sensitivity. If your LLM application processes customer data, legal documents, medical records, or anything regulated, the right choice is usually self-hosted Langfuse: all trace data stays in your infrastructure.

For internal tools or less-sensitive applications, Langfuse Cloud or Helicone removes the operational overhead of running the observability platform. Both offer generous free tiers for small volumes.

Either way, instrument from day one. Retrofitting observability into a production AI feature after a quality problem surfaces is significantly harder than building it in at launch.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored