Skip to content

AI Integration · Quality Assurance

LLM Evals in Practice: Testing AI Features Before They Go Wrong

Unit tests tell you if your code does what you wrote. They don't tell you if your AI feature does what users need. Here's how to build an evaluation pipeline that catches the failures that matter before users do.

Anurag Verma

Anurag Verma

8 min read

LLM Evals in Practice: Testing AI Features Before They Go Wrong

Sponsored

Share

The first time a user asks your AI feature something it handles badly, you want to have already seen that failure in a test run. Not after they post about it. Not after you get a support ticket. In a local or CI run, where you can fix it before it ships.

That’s what LLM evaluation — evals — is for. It’s not a new concept, but the tooling has matured enough in the last year that the barrier to building a serious eval pipeline is much lower than it was in 2024.

Here’s what the practice looks like in 2026, with real tooling rather than abstract frameworks.

Why Unit Tests Aren’t Enough

Consider a RAG-based Q&A feature over your product documentation. You write a test:

def test_returns_string():
    response = qa_feature("How do I reset my password?")
    assert isinstance(response, str)
    assert len(response) > 0

This test passes. It always passes. It tells you the function runs. It tells you nothing about whether the answer is correct, relevant, or safe.

LLM outputs are probabilistic. The same prompt returns slightly different answers on each call. The “correct” answer often isn’t binary — “your password reset link will be sent within 5 minutes” is correct; “your password reset link will be sent within 2 minutes” is subtly wrong; “reset your password by calling support” is completely wrong. A unit test can’t distinguish these.

Evals are test cases with a different evaluation strategy: instead of asserting an exact output, you assess properties of the output — relevance, faithfulness to source documents, presence of specific claims, absence of hallucinated content.

The Three Eval Approaches

1. Exact or Substring Match

Use this when the correct answer is well-defined and finite. Routing decisions, classification labels, entity extraction, SQL generation.

test_cases = [
    {"input": "book a flight to Tokyo", "expected_intent": "travel_booking"},
    {"input": "cancel my subscription", "expected_intent": "account_management"},
    {"input": "what's the weather tomorrow", "expected_intent": "out_of_scope"},
]

def eval_intent_classifier(test_cases):
    results = []
    for case in test_cases:
        predicted = classify_intent(case["input"])
        results.append({
            "input": case["input"],
            "expected": case["expected_intent"],
            "predicted": predicted,
            "correct": predicted == case["expected_intent"],
        })
    accuracy = sum(r["correct"] for r in results) / len(results)
    return accuracy, results

For structured output extraction, compare the extracted fields rather than raw strings:

def eval_extraction(response_json, expected):
    return all(
        response_json.get(k) == v
        for k, v in expected.items()
    )

2. Model-Graded Evaluation

When the correct answer isn’t a fixed string, use a model to judge the quality of another model’s output. This works for open-ended Q&A, summaries, explanations, and content generation.

The eval model scores or classifies the answer against a rubric:

import anthropic

eval_client = anthropic.Anthropic()

def llm_judge(question: str, answer: str, source_context: str) -> dict:
    prompt = f"""You are evaluating an AI assistant's response.

Question: {question}
Source context: {source_context}
Assistant's answer: {answer}

Rate the answer on two dimensions:
1. Faithfulness (1-5): Does the answer only use information from the source context?
2. Relevance (1-5): Does the answer address what was asked?

Respond as JSON: {{"faithfulness": <int>, "relevance": <int>, "reasoning": "<one sentence>"}}"""

    response = eval_client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    import json
    return json.loads(response.content[0].text)

Use a smaller, cheaper model for evaluation — it doesn’t need the same capability as your production model. Haiku or GPT-4o-mini work well as evaluators for most rubrics.

One caution: model-graded evals introduce their own biases. The evaluator model tends to prefer verbose answers over concise ones and can miss subtle hallucinations. Validate your eval rubric by having humans score a sample and checking that the model’s scores correlate.

3. Retrieval Quality (for RAG)

If your feature uses retrieval, evaluate the retrieval step separately from the generation step. A failure in retrieval looks like a generation failure but requires a different fix.

Context precision: Of the chunks retrieved, how many were actually relevant to the question? Low precision means you’re stuffing irrelevant context into your prompt, which increases cost and can confuse the model.

Context recall: Of all the relevant information that exists, how much did you retrieve? Low recall means the model didn’t have the information it needed to answer correctly.

def eval_retrieval(question: str, retrieved_chunks: list[str], ground_truth_answer: str) -> dict:
    # Ask the evaluator: given only these chunks, can you answer the question?
    chunks_text = "\n\n---\n\n".join(retrieved_chunks)
    
    response = eval_client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"""Given these retrieved chunks:
{chunks_text}

And this ground truth answer: {ground_truth_answer}

1. Is the ground truth answer derivable from the chunks? (yes/no)
2. Are any chunks completely irrelevant to the question "{question}"? (list indices, or "none")

Respond as JSON."""
        }]
    )
    return json.loads(response.content[0].text)

Tooling: What’s Worth Using

DeepEval

DeepEval is the most complete open-source eval library as of 2026. It ships with pre-built metrics for G-Eval (general), faithfulness, contextual relevance, hallucination detection, answer relevancy, and bias. It integrates with pytest so your evals run as part of your test suite.

from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_qa_faithfulness():
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output=qa_feature("What is the refund policy?"),
        retrieval_context=["Refunds are available within 30 days of purchase for unused items."]
    )
    metric = FaithfulnessMetric(threshold=0.8, model="gpt-4o-mini")
    assert_test(test_case, [metric])

Run with deepeval test run test_qa.py. It outputs a table of scores per test case, flags failures, and caches results so rerunning only re-evaluates changed cases.

Ragas

Ragas focuses specifically on RAG evaluation. Its metrics (faithfulness, answer relevancy, context precision, context recall) are well-documented and have been widely benchmarked against human judgment. If you’re building a RAG feature, Ragas is the first tool to reach for.

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

data = {
    "question": ["What is the return policy?"],
    "answer": [qa_feature("What is the return policy?")],
    "contexts": [retrieved_chunks_for_question],
    "ground_truth": ["Returns accepted within 30 days with receipt."]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print(results)

PromptFoo

PromptFoo takes a different approach: it’s a YAML-configured test runner that runs your prompts against multiple inputs, compares outputs across model versions, and can run red-team style adversarial tests. It’s less code-first than DeepEval and better for non-engineers to configure.

# promptfooconfig.yaml
prompts:
  - "Answer the following question about our product: {{question}}"

providers:
  - id: anthropic:claude-haiku-4-5-20251001
  - id: openai:gpt-4o-mini

tests:
  - vars:
      question: "How do I cancel my subscription?"
    assert:
      - type: llm-rubric
        value: "The response should mention account settings or contacting support"
  - vars:
      question: "How do I get a refund?"
    assert:
      - type: contains
        value: "30 days"

Run with npx promptfoo eval. Useful for comparing model versions before upgrading or for regression testing after prompt changes.

Building an Eval Dataset

The tooling is only as good as your test cases. A representative eval dataset is the hard part.

Start with real user queries. If you have production traffic, sample actual questions your feature received. These are worth more than synthetic examples because they reflect what users actually ask, including the edge cases you didn’t imagine during design.

Collect failure cases. When users report that the feature answered incorrectly, add that query (plus the expected correct answer) to the eval set. Over time, this turns your eval into a regression suite against known failures.

Curate, don’t generate. It’s tempting to generate 500 synthetic test cases from a model. The result is usually a set of cases that are easier than real queries and miss the long-tail failures that matter. Human curation of 50 cases beats model-generated 500.

Annotate ground truth carefully. For Q&A, write the ideal answer explicitly. For classification, record the correct label with a brief justification. This annotation work is slow, but it pays off in eval quality.

Integrating Evals Into CI

Evals that only run locally don’t catch regressions. The goal is to run them automatically on pull requests that touch prompts, retrieval logic, or model configuration.

A minimal CI integration with GitHub Actions:

name: LLM Evals

on:
  pull_request:
    paths:
      - 'src/ai/**'
      - 'prompts/**'
      - 'retrieval/**'

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install deepeval ragas
      - run: deepeval test run tests/evals/
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Keep the eval set small enough that CI completes in under 5 minutes. A 50-100 case set with model-graded scoring typically runs in 2-3 minutes and costs under $0.10 per run using small evaluator models.

What to Measure Over Time

Once your pipeline is running, track these metrics per release:

  • Average faithfulness score (for RAG features)
  • Per-category accuracy (if you have classification)
  • Failure rate (cases below threshold)
  • Latency distribution (p50 and p99)

Declining scores on a specific category between versions usually point to a prompt change or a retrieval configuration change affecting that category. With per-release tracking, you can correlate score changes to specific commits rather than debugging in the dark.

The goal isn’t a perfect score — it’s a stable score. An AI feature that scores 87% faithfulness consistently is easier to reason about and debug than one that averages 95% but has high variance. Stability tells you that your feature behaves predictably. Users can calibrate their trust to predictable systems.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored