Appify Intelligence - AI Development & Automation Specialists
Home
AI ConsultingAI Augmented Web SolutionsAI Chatbots & AgentsAI AutomationRAG SystemsAI Dashboards
ExpertiseSuccess storiesBlogContact
Appify Intelligence - AI Development & Automation Specialists
Home
AI ConsultingAI Augmented Web SolutionsAI Chatbots & AgentsAI AutomationRAG SystemsAI Dashboards
ExpertiseSuccess storiesBlogContact
Appify Intelligence - AI Development & Automation SpecialistsHomeServicesExpertiseSuccess StoriesContact us

Appify

AI Business solutions experts

Trusted partners in driving innovation, systems automation, business intelligence and sustainable competitive advantage with AI.

Schedule a meeting

Book a free initial consultation with our app development experts and let's discuss your app design and development options.

Book a Call

Business Hours

Monday - Friday:9:00 AM - 5:00 PM
Saturday - Sunday:Closed

Contact

1 800 852 307hello@appify.digital

Head Office

Appify Ltd., Ashfield, Tullamore, Co. Offaly, Ireland. R35 KX60

View on Map

Customer Reviews

5.0(22 reviews)

Jaspal Kharbanda

"What is really impressive was a value-driven engagement with Appify. They genuinely care about delivering quality."

Stephen Gribben

"Appify have become more than just my tech partner... Their communication led to seamless collaboration."

Leave a Review

Find Us

Google MapsGet Directions
LinkedInYouTubeInstagramTikTokFacebook
© Appify Digital 2026
  1. Back to blog
AI Engineering

RAG is not a search bar: five mistakes we keep seeing in SME retrieval

Most SME RAG systems break at the boring edges, not the vector database. Five mistakes we keep finding in other teams' retrieval, and the cheap fixes.

AI
Appify Intelligence Team
|22 April 2026|9 minutes
Rows of server racks in a blue-lit data centre with cabling and status LEDs

Retrieval-Augmented Generation has become the default pattern for anything that involves "ask questions over our documents". The reference architecture is a one-page diagram: chunk, embed, store, top-k cosine, prompt. A weekend's work for a senior engineer.

We also keep getting asked to fix RAG builds that followed exactly that recipe and are returning confident nonsense. When we open the hood, the vector database is almost never the problem. The problem is a handful of boring decisions made at the edges - how content is indexed, whether retrieval is ever measured, and what gets sent to the model. None of those fixes require rearchitecture. Most of them are a sprint.

This post is for product and technical owners whose retrieval is live, is underperforming, and whose team is already reaching for a different database. It is the list of mistakes we see most often, the evidence each one has in the literature, and the smallest thing you can ship to diagnose it.

1. Context-less chunking

The textbook recipe is to split documents into fixed-size pieces of a few hundred tokens and embed each piece independently. Anthropic's own Contextual Retrieval guidance uses chunks of that size too, so the size itself is not the problem. The problem is that the chunk is indexed without the surrounding context it needs to be individually meaningful.

Their example is sharp. A chunk that reads "The company's revenue grew by 3% over the previous quarter." is mathematically retrievable but contextually useless - which company, which quarter. In a corpus of 10,000 filings, that embedding collides with every other company's every other quarter. Cosine similarity is high. Precision is zero.

Anthropic's fix is to prepend a 50-100 token chunk-specific context to each chunk before embedding it - essentially, a one-sentence answer to "where does this chunk sit in the document?". On their benchmark, contextual embeddings plus contextual BM25 reduced the top-20 retrieval failure rate by 49%. With a reranker on top, 67%. Those are not tuning-knob numbers. They are the difference between a shipping product and a rearchitecture meeting.

The cheap version of this fix is one LLM call per chunk at index time. At Anthropic's published cache-hit price of $0.30 per million tokens for Claude Sonnet - the 90% prompt-caching discount - indexing a ten-thousand-chunk corpus costs a few dollars. It is the cheapest 49% accuracy gain you will find in any AI system, and most of the teams we pick up from have never heard of it.

2. Single-vector retrieval with no reranking

High cosine similarity is not relevance. It is a measure of embedding proximity, and embedding models are trained to cluster semantically related text. They do not know what your user actually asked.

The symptom is familiar. A user asks "what was our revenue in Q3?" and the system returns the top five chunks where the embedding is closest to the query embedding. Three of them are about revenue - but for Q2, Q4, and a competitor. One is the quarter-over-quarter comparison. Only one is the answer. The model, given the five, hedges or averages or hallucinates.

The fix is a reranker: a second-stage model that takes the top-N retrieved chunks and scores each one against the query directly, not in embedding space. Cohere's Rerank, voyage-rerank, and cross-encoders from the open-source world all do this. Anthropic's own Contextual Retrieval benchmark attributes the jump from 49% to 67% failure reduction specifically to the reranking stage.

This is the second-cheapest fix on the list. It is a single additional API call per query. It adds a few hundred milliseconds of latency. It usually removes the biggest class of user-visible wrong answers.

3. No retrieval evaluation harness

The most common thing we find is that the team has never measured retrieval quality. There is no test set of questions with known-correct source chunks, no precision@k number tracked over time, no regression test when the embedding model is swapped or the chunker is retuned.

This is not a niche methodological concern. It is the reason nobody on the team can tell you whether yesterday's "small change to the chunker" made things better or worse. It is the reason the product owner and the engineer disagree about whether retrieval is working - they are both guessing from vibes.

A first evaluation harness is genuinely small. Twenty to fifty questions, each tagged with the document sections that should be retrieved to answer it correctly. A script that runs retrieval for each, measures whether the ground-truth chunk appears in the top-k, and outputs a single recall number. Total engineering effort: two days, including writing the evaluation questions.

Once you have the harness, every embedding-model swap, every chunker change, every reranker experiment becomes an A/B test with a number attached. Without it, you are making changes in the dark and hoping the demo answers look more impressive to whoever is in the room. We have not yet met a shipping RAG system that had a properly maintained eval harness before we were called in. That gap is almost definitional.

4. Pure vector search without hybrid BM25 or metadata filters

Dense embeddings are strong at paraphrased intent. They are weak at rare words. An SME's corpus is full of rare words: product codes, regulatory citations, drug names, court case references, SKU numbers, client acronyms. Those are exactly the terms a user types when they know what they want.

The foundational benchmark here is MTEB, which found that no single embedding method dominates across tasks. A model strong at semantic textual similarity may underperform on retrieval over domain-specific vocabulary. The practical implication for SME retrieval is that pure dense search will systematically miss queries that contain exactly the identifiers your domain uses.

The fix is hybrid retrieval: run both a BM25 lexical search and a dense vector search, then fuse the rankings. Reciprocal Rank Fusion is the standard combiner and works without tuning. On top of that, metadata filters - customer_id, date_range, document_type - cut the candidate set before either search runs. In a multi-tenant product this is not optional; it is how you stop tenant A's retrieval from ever seeing tenant B's documents in the first place.

We built a news-clustering pipeline for a publisher last year that ingests 20-30 RSS feeds a day and groups related stories using pgvector embeddings. It works well. But for the subset of queries that include named entities - politicians, company names, product launches - we still lean on keyword filters first. Dense embeddings alone put related-but-different stories into the same cluster and the editors flagged it immediately.

5. Reaching for the 1M-token context window at the wrong scale

"Claude has a million-token context. We will just paste the whole corpus" is an answer that is sometimes correct and sometimes expensive nonsense, and the difference matters.

Anthropic's own Contextual Retrieval post is explicit: if your knowledge base is smaller than 200,000 tokens, about 500 pages, skip RAG and stuff the context. With prompt caching, it is cheap and it is simpler than maintaining a retrieval pipeline. We have shipped several small-corpus features exactly this way, for clients whose "corpus" is a policy handbook, a product spec, or a single long contract.

Beyond that threshold, three things break. Cost scales linearly with input tokens even with caching. Latency scales with input tokens. And accuracy degrades with context length - the Lost in the Middle paper showed this in 2023 and Anthropic's own engineering team now frames it as an attention budget that depletes with every token, degrading recall as the context fills. Databricks' long-context RAG benchmark found most models plateau or degrade past 32k to 64k tokens. The frontier has moved since, but not to the point where "paste everything at a million tokens" is free of accuracy cost.

The right rule is simple. Under 200k tokens, try the naive paste first - it is Anthropic's own recommendation and it saves you a quarter of the engineering cost of a RAG system you do not need. Above that, retrieve.

The theme under all five

Small fixes, boring places

None of these are exotic fixes. Contextual chunking is an LLM call at index time. Reranking is one extra API call per query. Evaluation is a two-day harness. Hybrid retrieval is a half-day change in most stacks. Knowing when to skip RAG entirely is a reading comprehension exercise.

The pattern we find is that teams spend rearchitecture money on symptoms that the five boring fixes above would have caught. Productivity with AI is easy - it is trivial to ship a RAG chatbot. Profit is harder, and it lives in whether the retrieval is good enough that a human in the loop trusts the outputs instead of double-checking every one of them. That is what the governance side of AI actually looks like at SME scale: not a compliance document, a precision@k number your team trusts.

If any of this sounds like the system sitting on your roadmap, the cheapest diagnostic is to write twenty evaluation questions this week and run them against what you have. The output will tell you which of the five mistakes is yours. If you would like us to do that exercise with you, our RAG systems and AI consulting practices run exactly this kind of retrieval audit for clients. Book a call - we will tell you honestly whether the fix is a sprint or a rebuild.

Tagged

ragai-engineeringretrievalvector-searchevaluation

Ready to talk?

If this post maps to a problem you're hitting, we'd like to hear about it. We turn AI experiments into production systems.

Start a conversation