RAG is not a search bar: five mistakes we keep seeing in SME retrieval

Retrieval-Augmented Generation has become the default pattern for anything that involves "ask questions over our documents". The reference architecture is a one-page diagram: chunk, embed, store, top-k cosine, prompt. A weekend's work for a senior engineer.

We also keep getting asked to fix RAG builds that followed exactly that recipe and are returning confident nonsense. When we open the hood, the vector database is almost never the problem. The problem is a handful of boring decisions made at the edges - how content is indexed, whether retrieval is ever measured, and what gets sent to the model. None of those fixes require rearchitecture. Most of them are a sprint.

This post is for product and technical owners whose retrieval is live, is underperforming, and whose team is already reaching for a different database. It is the list of mistakes we see most often, the evidence each one has in the literature, and the smallest thing you can ship to diagnose it.

1. Context-less chunking

The textbook recipe is to split documents into fixed-size pieces of a few hundred tokens and embed each piece independently. Anthropic's own Contextual Retrieval guidance uses chunks of that size too, so the size itself is not the problem. The problem is that the chunk is indexed without the surrounding context it needs to be individually meaningful.

Their example is sharp. A chunk that reads "The company's revenue grew by 3% over the previous quarter." is mathematically retrievable but contextually useless - which company, which quarter. In a corpus of 10,000 filings, that embedding collides with every other company's every other quarter. Cosine similarity is high. Precision is zero.

Anthropic's fix is to prepend a 50-100 token chunk-specific context to each chunk before embedding it - essentially, a one-sentence answer to "where does this chunk sit in the document?". On their benchmark, contextual embeddings plus contextual BM25 (a classic keyword-matching algorithm from information retrieval) reduced the top-20 retrieval failure rate by 49%. With a reranker on top, 67%. Those are not tuning-knob numbers. They are the difference between a shipping product and a rearchitecture meeting.

The cheap version of this fix is one LLM call per chunk at index time. At Anthropic's published cache-hit price of $0.30 per million tokens for Claude Sonnet - the 90% prompt-caching discount - indexing a ten-thousand-chunk corpus costs a few dollars. It is the cheapest 49% accuracy gain you will find in any AI system, and most of the teams we pick up from have never heard of it.

2. Single-vector retrieval with no reranking

High cosine similarity is not relevance. It is a measure of embedding proximity, and embedding models are trained to cluster semantically related text. They do not know what your user actually asked.

The symptom is familiar. A user asks "what was our revenue in Q3?" and the system returns the top five chunks where the embedding is closest to the query embedding. Three of them are about revenue - but for Q2, Q4, and a competitor. One is the quarter-over-quarter comparison. Only one is the answer. The model, given the five, hedges or averages or hallucinates.

The fix is a reranker: a second-stage model that takes the top-N retrieved chunks and scores each one against the query directly, not in embedding space. Cohere's Rerank, voyage-rerank, and cross-encoders from the open-source world all do this. Anthropic's own Contextual Retrieval benchmark attributes the jump from 49% to 67% failure reduction specifically to the reranking stage.

This is the second-cheapest fix on the list. It is a single additional API call per query. It adds a few hundred milliseconds of latency. It usually removes the biggest class of user-visible wrong answers.

3. No retrieval evaluation harness

The most common thing we find is that the team has never measured retrieval quality. There is no test set of questions with known-correct source chunks, no precision@k number tracked over time, no regression test when the embedding model is swapped or the chunker is retuned.

This is not a niche methodological concern. It is the reason nobody on the team can tell you whether yesterday's "small change to the chunker" made things better or worse. It is the reason the product owner and the engineer disagree about whether retrieval is working - they are both guessing from vibes.

A first evaluation harness is genuinely small. Twenty to fifty questions, each tagged with the document sections that should be retrieved to answer it correctly. A script that runs retrieval for each, measures whether the ground-truth chunk appears in the top-k, and outputs a single recall number. Total engineering effort: two days, including writing the evaluation questions.

Once you have the harness, every embedding-model swap, every chunker change, every reranker experiment becomes an A/B test with a number attached. Without it, you are making changes in the dark and hoping the demo answers look more impressive to whoever is in the room. We have not yet met a shipping RAG system that had a properly maintained eval harness before we were called in. That gap is almost definitional.

4. Pure vector search without hybrid BM25 or metadata filters

Dense embeddings are strong at paraphrased intent. They are weak at rare words. An SME's corpus is full of rare words: product codes, regulatory citations, drug names, court case references, SKU numbers, client acronyms. Those are exactly the terms a user types when they know what they want.

The foundational benchmark here is MTEB (Massive Text Embedding Benchmark), which found that no single embedding method dominates across tasks. A model strong at semantic textual similarity may underperform on retrieval over domain-specific vocabulary. The practical implication for SME retrieval is that pure dense search will systematically miss queries that contain exactly the identifiers your domain uses.

The fix is hybrid retrieval: run both a BM25 lexical search and a dense vector search, then fuse the rankings. Reciprocal Rank Fusion is the standard combiner and works without tuning. On top of that, metadata filters - customer_id, date_range, document_type - cut the candidate set before either search runs. In a multi-tenant product this is not optional; it is how you stop tenant A's retrieval from ever seeing tenant B's documents in the first place.

We built a news-clustering pipeline for a publisher last year that ingests 20-30 RSS feeds a day and groups related stories using pgvector embeddings. It works well. But for the subset of queries that include named entities - politicians, company names, product launches - we still lean on keyword filters first. Dense embeddings alone put related-but-different stories into the same cluster and the editors flagged it immediately.

5. Reaching for the 1M-token context window at the wrong scale

"Claude has a million-token context. We will just paste the whole corpus" is an answer that is sometimes correct and sometimes expensive nonsense, and the difference matters.

Anthropic's own Contextual Retrieval post is explicit: if your knowledge base is smaller than 200,000 tokens, about 500 pages, skip RAG and stuff the context. With prompt caching, it is cheap and it is simpler than maintaining a retrieval pipeline. We have shipped several small-corpus features exactly this way, for clients whose "corpus" is a policy handbook, a product spec, or a single long contract.

Beyond that threshold, three things break. Cost scales linearly with input tokens even with caching. Latency scales with input tokens. And accuracy degrades with context length - the Lost in the Middle paper showed this in 2023 and Anthropic's own engineering team now frames it as an attention budget that depletes with every token, degrading recall as the context fills. Databricks' long-context RAG benchmark found most models plateau or degrade past 32k to 64k tokens. The frontier has moved since, but not to the point where "paste everything at a million tokens" is free of accuracy cost.

The right rule is simple. Under 200k tokens, try the naive paste first - it is Anthropic's own recommendation and it saves you a quarter of the engineering cost of a RAG system you do not need. Above that, retrieve.

The theme under all five

Small fixes, boring places

None of these are exotic fixes. Contextual chunking is an LLM call at index time. Reranking is one extra API call per query. Evaluation is a two-day harness. Hybrid retrieval is a half-day change in most stacks. Knowing when to skip RAG entirely is a reading comprehension exercise.

The pattern we find is that teams spend rearchitecture money on symptoms that the five boring fixes above would have caught. Productivity with AI is easy - it is trivial to ship a RAG chatbot. Profit is harder, and it lives in whether the retrieval is good enough that a human in the loop trusts the outputs instead of double-checking every one of them. That is what the governance side of AI actually looks like at SME scale: not a compliance document, a precision@k number your team trusts.

If any of this sounds like the system sitting on your roadmap, the cheapest diagnostic is to write twenty evaluation questions this week and run them against what you have. The output will tell you which of the five mistakes is yours. If you would like us to do that exercise with you, our RAG systems and AI consulting practices run exactly this kind of retrieval audit for clients. Book a call - we will tell you honestly whether the fix is a sprint or a rebuild.

1. Context-less chunking

2. Single-vector retrieval with no reranking

High cosine similarity is not relevance. It is a measure of embedding proximity, and embedding models are trained to cluster semantically related text. They do not know what your user actually asked.

3. No retrieval evaluation harness

4. Pure vector search without hybrid BM25 or metadata filters

5. Reaching for the 1M-token context window at the wrong scale

"Claude has a million-token context. We will just paste the whole corpus" is an answer that is sometimes correct and sometimes expensive nonsense, and the difference matters.

RAG is not a search bar: five mistakes we keep seeing in SME retrieval

1. Context-less chunking

2. Single-vector retrieval with no reranking

3. No retrieval evaluation harness

4. Pure vector search without hybrid BM25 or metadata filters

5. Reaching for the 1M-token context window at the wrong scale

The theme under all five

Small fixes, boring places

Ready to talk?

Related articles

How to add AI to existing software without rebuilding

Agents in production: the control surface is the product

AI integration patterns 2026: pick by where the data lives

RAG is not a search bar: five mistakes we keep seeing in SME retrieval

1. Context-less chunking

2. Single-vector retrieval with no reranking

3. No retrieval evaluation harness

4. Pure vector search without hybrid BM25 or metadata filters

5. Reaching for the 1M-token context window at the wrong scale

The theme under all five

Small fixes, boring places

Ready to talk?

Related articles

How to add AI to existing software without rebuilding

Agents in production: the control surface is the product

AI integration patterns 2026: pick by where the data lives