We built a RAG system that answered 40% of queries wrong. Not "close but not quite" wrong. Confidently, convincingly, dead wrong. The kind of wrong where a client's legal team calls you because your AI just told a customer they were entitled to a refund they absolutely were not entitled to.

The fix was not better embeddings. It was not a bigger model. It was not more data. It was rethinking the entire retrieval pipeline from scratch. And after rebuilding it, accuracy jumped from 60% to 94%. That 34-point improvement came not from any single change but from stacking five specific techniques that most RAG tutorials never mention.

If you have built a RAG system and been disappointed by the results, this post will explain why and show you exactly what to do about it. I am going to walk through the evolution from naive RAG to what we now call contextual retrieval, with code examples, benchmarks, and the actual production stack we use at CODERCOPS for client projects.

The Evolution: How We Got Here

Naive RAG (2023): The "Just Embed Everything" Era

The original RAG pattern was simple: chunk your documents, embed the chunks, store them in a vector database, and retrieve the most similar chunks when a user asks a question. Then feed those chunks to an LLM to generate an answer.

User Query → Embed → Vector Search → Top K Chunks → LLM → Answer

This worked shockingly well for demos. And shockingly badly for production.

Why naive RAG fails:

  1. Chunks lose context. When you split a document into 500-token chunks, each chunk loses the context of what document it came from, what section it was in, and what came before and after it.

  2. Semantic similarity is not the same as relevance. "How do I cancel my subscription?" and "Our subscription plans offer flexibility" are semantically similar but answering with the second chunk would be wrong.

  3. No keyword matching. Vector search is great for semantic similarity but terrible for exact matches. If a user asks about "Policy 4.2.1", vector search might return chunks about "policy guidelines" instead of the specific numbered policy.

  4. The top-K trap. Retrieving the top 5 chunks sounds reasonable until you realize that for complex queries, the answer might span 15 different chunks across 3 documents.

Advanced RAG (2024): Patches on a Broken Foundation

The industry responded with a grab bag of improvements: re-ranking, query expansion, hypothetical document embeddings (HyDE), metadata filtering, parent-child chunk retrieval. Each helped individually, but they were patches on a fundamentally flawed architecture.

Contextual Retrieval (2025-2026): Rethinking from First Principles

What we do now is fundamentally different. Instead of treating retrieval as a single step, we treat it as a multi-stage pipeline where each stage refines the results. The key insight is that the chunk itself should carry enough context to be useful without seeing the full document.

Chunking Strategies: Where 90% of RAG Systems Go Wrong

I cannot overstate this: your chunking strategy determines your RAG system's ceiling. No amount of fancy retrieval or re-ranking will fix poorly chunked data.

Fixed-Size Chunking: The Default That Should Not Be

Most tutorials show you this:

# DON'T DO THIS IN PRODUCTION
def fixed_chunk(text: str, chunk_size: int = 500) -> list[str]:
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunks.append(" ".join(words[i:i + chunk_size]))
    return chunks

This is like cutting a book into pages by counting words. You will split sentences in half, break paragraphs in the middle of a thought, and separate headers from their content.

Accuracy with fixed chunking: In our benchmarks across 5 client datasets, fixed chunking delivered 52-61% accuracy on question-answering tasks. Not good enough for anything that matters.

Recursive Chunking: Better, but Still Context-Blind

Recursive chunking respects document structure by splitting on paragraphs first, then sentences, then words:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document_text)

Accuracy with recursive chunking: 65-72% across the same datasets. Better, but still losing important context.

Semantic Chunking: Grouping by Meaning

Semantic chunking uses embeddings to determine where to split. The idea is that you should split at points where the meaning changes significantly:

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunk(
    sentences: list[str],
    threshold: float = 0.3
) -> list[str]:
    """Split at points where semantic similarity drops."""
    embeddings = model.encode(sentences)
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        # Cosine similarity between adjacent sentences
        sim = np.dot(embeddings[i-1], embeddings[i]) / (
            np.linalg.norm(embeddings[i-1]) *
            np.linalg.norm(embeddings[i])
        )

        if sim < threshold:
            # Significant meaning shift -- start new chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Accuracy with semantic chunking: 74-80%. Now we are getting somewhere, but we can do much better.

Contextual Chunking: The State of the Art

This is what we actually use in production. The idea, originally outlined by Anthropic, is simple but powerful: before embedding a chunk, prepend context about where it came from.

import anthropic

client = anthropic.Anthropic()

def add_context_to_chunk(
    chunk: str,
    full_document: str,
    document_title: str
) -> str:
    """Use an LLM to generate context for each chunk."""

    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Here is a document titled "{document_title}":

<document>
{full_document[:10000]}
</document>

Here is a chunk from that document:

<chunk>
{chunk}
</chunk>

Please provide a short (2-3 sentence) context that explains
where this chunk fits within the document and what key topic
it covers. This context will be prepended to the chunk for
search indexing. Be specific and include any relevant section
names or topic identifiers."""
        }]
    )

    context = response.content[0].text
    return f"CONTEXT: {context}\n\n{chunk}"


def contextual_chunk_pipeline(
    document: str,
    title: str
) -> list[str]:
    """Full contextual chunking pipeline."""
    # Step 1: Semantic chunking for natural boundaries
    sentences = split_into_sentences(document)
    raw_chunks = semantic_chunk(sentences, threshold=0.35)

    # Step 2: Add context to each chunk
    contextualized = []
    for chunk in raw_chunks:
        enriched = add_context_to_chunk(chunk, document, title)
        contextualized.append(enriched)

    return contextualized

Accuracy with contextual chunking: 85-91%. But we are not done yet.

Chunking Strategy Comparison

Strategy Accuracy Latency (indexing) Cost (per 1000 docs) Complexity
Fixed (500 tokens) 52-61% Fast (seconds) ~$0.00 Trivial
Recursive (1000 + overlap) 65-72% Fast (seconds) ~$0.00 Low
Semantic 74-80% Moderate (minutes) ~$0.50 (embedding) Medium
Contextual 85-91% Slow (10-30 min) ~$15.00 (LLM calls) High

The tradeoff is clear: contextual chunking costs more upfront but the accuracy improvement is worth it for any system where wrong answers have consequences. We pay the indexing cost once and get better retrieval on every query.

Hybrid Search: The Secret Weapon

Here is a fact that surprises most people: combining vector search with old-fashioned keyword search outperforms either one alone by 10-15 percentage points.

The reason is that vector search and keyword search fail in complementary ways:

  • Vector search misses exact terms (policy numbers, product names, error codes)
  • Keyword search (BM25) misses semantic meaning (synonym matching, conceptual similarity)

The Architecture

from qdrant_client import QdrantClient
from qdrant_client.models import (
    SearchParams, NamedSparseVector, NamedVector,
    Prefetch, Query, FusionQuery, Fusion
)

class HybridSearchEngine:
    def __init__(self, collection_name: str):
        self.client = QdrantClient(url="http://localhost:6333")
        self.collection = collection_name

    def search(
        self,
        query: str,
        top_k: int = 20
    ) -> list[dict]:
        """Hybrid search combining dense + sparse vectors."""

        # Generate dense embedding for semantic search
        dense_vector = embed_query(query)

        # Generate sparse vector for keyword search (BM25)
        sparse_vector = bm25_encode(query)

        # Reciprocal Rank Fusion combines both result sets
        results = self.client.query_points(
            collection_name=self.collection,
            prefetch=[
                # Semantic search
                Prefetch(
                    query=dense_vector,
                    using="dense",
                    limit=top_k
                ),
                # Keyword search
                Prefetch(
                    query=NamedSparseVector(
                        name="bm25",
                        vector=sparse_vector
                    ),
                    limit=top_k
                ),
            ],
            # Fuse results using RRF
            query=FusionQuery(fusion=Fusion.RRF),
            limit=top_k
        )

        return [
            {
                "text": hit.payload["text"],
                "score": hit.score,
                "source": hit.payload["source"],
                "chunk_id": hit.id
            }
            for hit in results
        ]

Why Reciprocal Rank Fusion (RRF) Works

RRF is elegant. Instead of trying to normalize and combine scores from two different ranking systems (which is mathematically messy), it combines rankings:

RRF_score = sum(1 / (k + rank_i)) for each ranking system

Where k is a constant (typically 60) and rank_i is the position in each result set.

A document that ranks #1 in both systems gets a high combined score. A document that ranks #1 in one system but does not appear in the other still gets credit for its strong single ranking. It is simple, robust, and works surprisingly well.

Benchmark results: On our client datasets, hybrid search with RRF delivered 88-94% accuracy, compared to 82-88% for vector-only and 65-72% for keyword-only.

Reranking: The Final 3-5% That Matters

After retrieval, you have your candidate chunks. But the initial ranking (from hybrid search) is approximate. A dedicated reranking model can reorder those chunks for significantly better relevance.

The Approaches

Cohere Rerank: API-based, easy to integrate, good performance. We use this for most projects.

import cohere

co = cohere.ClientV2(api_key="your-key")

def rerank_results(
    query: str,
    documents: list[str],
    top_n: int = 5
) -> list[dict]:
    """Rerank retrieved documents for better relevance."""

    response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=documents,
        top_n=top_n,
        return_documents=True
    )

    return [
        {
            "text": result.document.text,
            "relevance_score": result.relevance_score,
            "original_index": result.index
        }
        for result in response.results
    ]

Cross-encoders: Self-hosted, slower but no API dependency. Good for sensitive data.

ColBERT: Token-level matching gives excellent accuracy for long documents. More complex to set up but we use it for legal and compliance use cases where every word matters.

Reranking Performance Comparison

Reranker Accuracy Boost Latency Added Cost per 1K queries Best For
Cohere Rerank v3.5 +3-5% 50-100ms $1.00 General use
Cross-encoder (bge-reranker) +3-4% 100-200ms $0.00 (self-hosted) Sensitive data
ColBERT v2 +4-6% 150-300ms $0.00 (self-hosted) Legal, compliance
No reranking Baseline 0ms $0.00 Cost-sensitive

Our recommendation: Always use reranking unless you are extremely cost-sensitive. The accuracy improvement is worth the 50-100ms of added latency and the minimal cost.

The "Just Use a Bigger Context Window" Argument

I hear this constantly: "Why bother with RAG? Just stuff everything into Claude's 200K context window."

Let me be specific about when this works and when it does not.

When Large Context Windows Work

  • Small knowledge bases (under 100 pages). If your entire knowledge base fits in the context window with room to spare, RAG adds complexity for no benefit.
  • Single-document Q&A. Analyzing one long document? Just pass it in. Claude handles 200K tokens well.
  • Exploratory questions. "What themes appear across these documents?" benefits from seeing everything at once.

When Large Context Windows Fail

  • Large knowledge bases (1000+ documents). You simply cannot fit everything in one prompt. You need retrieval.
  • Precision matters. Even with a 200K context window, models perform worse at finding specific facts buried in long contexts ("needle in a haystack"). RAG with good retrieval is more precise.
  • Cost. Sending 200K tokens per query at Claude Sonnet pricing costs about $0.60 per query for input tokens alone. If you are handling 10,000 queries per day, that is $6,000/day just for input tokens. RAG with small focused chunks costs 10-20x less per query.
  • Latency. Processing 200K tokens takes 5-15 seconds. RAG with small chunks responds in 1-3 seconds.

The Cost Math

Approach Tokens per Query Cost per Query Daily Cost (10K queries)
Full context (200K) 200,000 $0.60 $6,000
RAG (top 5 chunks, ~5K tokens) 5,000 $0.015 $150
RAG + Reranking 5,500 $0.018 $180

The 40x cost difference is why RAG is not going anywhere, even as context windows grow.

Our Production RAG Stack at CODERCOPS

Here is the exact stack we deploy for client projects in 2026:

The Pipeline

Document Ingestion:
  1. Document parsing (Unstructured.io for PDFs,
     Docling for complex layouts)
  2. Semantic chunking with sentence-transformers
  3. Contextual enrichment with Claude Haiku
  4. Dual embedding: dense (OpenAI text-embedding-3-large)
     + sparse (BM25 via SPLADE)
  5. Index into Qdrant (self-hosted or Qdrant Cloud)

Query Pipeline:
  1. Query analysis (intent classification,
     entity extraction)
  2. Query expansion (generate 2-3 variant phrasings)
  3. Hybrid search (dense + sparse, RRF fusion)
  4. Reranking (Cohere Rerank v3.5)
  5. Context assembly (deduplicate, order by document
     position)
  6. Generation (Claude Sonnet with structured prompts)
  7. Citation verification (check claims against
     source chunks)

Query Analysis: The Underrated Step

Most RAG systems take the user's query as-is and search for it. This is a mistake. We add a query analysis step that dramatically improves retrieval:

async def analyze_query(query: str) -> dict:
    """Understand the query before searching."""

    analysis = await call_llm(f"""Analyze this search query:

Query: "{query}"

Return JSON with:
- intent: "factual", "comparison", "procedural",
  "exploratory", or "troubleshooting"
- entities: specific names, numbers, codes mentioned
- keywords: important terms for keyword search
- expanded_queries: 2-3 alternative phrasings that might
  match relevant documents
- filters: any metadata filters
  (date ranges, document types, categories)
""")

    return analysis

# Example:
# Query: "What's our refund policy for enterprise customers
#          who signed after January 2025?"
#
# Analysis:
# {
#   "intent": "factual",
#   "entities": ["enterprise", "January 2025"],
#   "keywords": ["refund policy", "enterprise customers"],
#   "expanded_queries": [
#     "enterprise refund policy terms",
#     "refund conditions for enterprise tier",
#     "enterprise customer agreement refund clause"
#   ],
#   "filters": {
#     "document_type": ["policy", "agreement"],
#     "date_after": "2025-01-01"
#   }
# }

We run the expanded queries in parallel and merge the results before reranking. This catches documents that the original query wording would have missed.

Citation Verification: Catching Hallucinations

This is the step most teams skip, and it is why their RAG systems occasionally produce confident but wrong answers. After the LLM generates a response, we verify every claim against the source chunks:

async def verify_citations(
    response: str,
    source_chunks: list[str]
) -> dict:
    """Verify that claims in the response are
    supported by sources."""

    verification = await call_llm(f"""
You are a fact-checker. Compare the response against the
source documents.

RESPONSE:
{response}

SOURCE DOCUMENTS:
{format_sources(source_chunks)}

For each factual claim in the response:
1. Is it directly supported by a source document?
   (SUPPORTED / UNSUPPORTED / PARTIALLY_SUPPORTED)
2. If supported, which source document?
3. If unsupported, is it a reasonable inference or a
   potential hallucination?

Return a JSON list of claims with their verification status.
""")

    # If any claims are UNSUPPORTED, flag for review
    # or regenerate
    unsupported = [
        c for c in verification["claims"]
        if c["status"] == "UNSUPPORTED"
    ]

    if unsupported:
        # Regenerate with stricter instructions
        return await regenerate_without_hallucinations(
            response, source_chunks, unsupported
        )

    return {"response": response, "verified": True}

This verification step catches 8-12% of responses that contain hallucinated facts. The cost is an extra LLM call per query (~$0.003), which is negligible compared to the cost of serving wrong information.

Real Benchmarks From Our Client Projects

Here are actual numbers from three different RAG systems we built:

Client A: E-commerce Product Support (15K product pages)

Configuration Accuracy Latency (p95) Cost/Query
Naive RAG (fixed chunks, vector only) 58% 1.2s $0.008
+ Recursive chunking 67% 1.3s $0.008
+ Hybrid search 78% 1.5s $0.010
+ Reranking 83% 1.8s $0.012
+ Contextual chunking 89% 1.9s $0.014
+ Query expansion + citation check 93% 2.4s $0.019
Configuration Accuracy Latency (p95) Cost/Query
Naive RAG 45% 1.4s $0.010
Full pipeline (contextual + hybrid + ColBERT reranker) 91% 2.8s $0.025

Client C: Internal Knowledge Base (500 wiki pages)

Configuration Accuracy Latency (p95) Cost/Query
Naive RAG 62% 1.1s $0.007
Full pipeline 94% 2.1s $0.016
Full context (no RAG, 200K window) 88% 8.5s $0.580

The Client C comparison is telling: Full context without RAG actually performed worse than our full RAG pipeline, was 4x slower, and cost 36x more per query. The context window approach struggles with precision even when all the content fits.

When to Skip RAG Entirely

RAG is not always the answer. Here are situations where we actively recommend against it:

1. Small, stable knowledge bases (under 50 pages). Just put it in the system prompt. Update it when the content changes. Simple, fast, cheap.

2. Single-document analysis. If users are asking questions about one document at a time, pass the whole document in the context. No retrieval needed.

3. The knowledge is already in the model. If you are building a coding assistant that answers general programming questions, the model already knows the answer. RAG would add latency for no benefit.

4. Real-time data needs. If users need stock prices, weather, or live data, RAG over a document store is the wrong pattern. You need API integrations, not retrieval.

Our decision framework:

Do you have proprietary documents the model hasn't seen?
  └── No → Skip RAG. Use prompting or fine-tuning.
  └── Yes →
      How much content?
        └── Under 50 pages → System prompt injection
        └── 50-500 pages → RAG or long context (compare costs)
        └── 500+ pages → RAG is your only option

Common Mistakes We See

After auditing a dozen RAG systems for clients, these are the mistakes we see repeatedly:

1. Not evaluating retrieval separately from generation. You need to know: is the problem that you are retrieving the wrong chunks, or that the LLM is misinterpreting the right chunks? These require completely different fixes.

2. Using the same embedding model for all content types. Code, legal text, conversational text, and technical documentation all embed differently. At minimum, test your embedding model on your actual data before committing to it.

3. Ignoring metadata. Document titles, dates, authors, categories -- this metadata is gold for filtering. A query about "2025 Q3 revenue" should filter by date before doing any semantic search.

4. Not updating the index. Your knowledge base changes. Documents get updated, new ones get added, old ones become obsolete. You need an incremental indexing pipeline, not a one-time batch job.

5. Skipping evaluation. You need a test set of questions with known correct answers. Without this, you are flying blind. We build evaluation sets of 100-200 questions for every RAG system and run accuracy benchmarks on every pipeline change.

Building Your Own Contextual Retrieval Pipeline

If you want to implement what we have described, here is the order I would suggest:

  1. Start with recursive chunking + vector search. Get the basics working.
  2. Add BM25 hybrid search. This is the single biggest improvement for the effort.
  3. Add reranking. Cohere Rerank is the easiest to integrate.
  4. Build an evaluation set. 100+ questions with known answers from your documents.
  5. Add contextual chunking. Now you can measure whether the extra indexing cost is worth it for your data.
  6. Add query expansion. Test whether it improves accuracy on your evaluation set.
  7. Add citation verification. Essential for any user-facing system.

Each step should be measured against your evaluation set. If a step does not improve accuracy on your data, skip it. Our pipeline is the result of iterative improvement on specific client datasets, not a one-size-fits-all prescription.

What Is Next for RAG

Looking ahead, I see three trends:

1. Agentic RAG. Instead of a fixed pipeline, an AI agent decides how to retrieve information -- choosing between different search strategies, deciding whether to do multi-hop retrieval, and adapting based on what it finds. We are already building these for complex research use cases.

2. Graph RAG. Adding knowledge graph structures on top of vector search for better relationship-aware retrieval. Microsoft's GraphRAG work is promising, and we are experimenting with it for enterprise clients with highly interconnected data.

3. Multimodal RAG. Retrieving not just text chunks but images, tables, and diagrams. This is critical for industries like manufacturing, healthcare, and engineering where visual information matters as much as text.

Ready to Fix Your RAG System?

If you have a RAG system that is underperforming, or you are planning to build one and want to get it right the first time, CODERCOPS can help. We have built contextual retrieval pipelines for e-commerce, legal, healthcare, and fintech clients -- each with measurably better accuracy than the naive RAG systems they replaced.

Reach out to us for a free assessment of your current retrieval architecture. We will tell you exactly where your pipeline is losing accuracy and what changes will have the biggest impact.

For more on building production AI systems, check out our other engineering deep dives.

Comments