Appify Intelligence - AI Development & Automation Specialists
Home
AI ConsultingAI Augmented Web SolutionsAI Chatbots & AgentsAI AutomationRAG SystemsAI Dashboards
ExpertiseSuccess storiesBlogContact
Appify Intelligence - AI Development & Automation Specialists
Home
AI ConsultingAI Augmented Web SolutionsAI Chatbots & AgentsAI AutomationRAG SystemsAI Dashboards
ExpertiseSuccess storiesBlogContact
Appify Intelligence - AI Development & Automation SpecialistsHomeServicesExpertiseSuccess StoriesContact us

Appify

AI Business solutions experts

Trusted partners in driving innovation, systems automation, business intelligence and sustainable competitive advantage with AI.

Schedule a meeting

Book a free initial consultation with our app development experts and let's discuss your app design and development options.

Book a Call

Business Hours

Monday - Friday:9:00 AM - 5:00 PM
Saturday - Sunday:Closed

Contact

1 800 852 307hello@appify.digital

Head Office

Appify Ltd., Ashfield, Tullamore, Co. Offaly, Ireland. R35 KX60

View on Map

Customer Reviews

5.0(22 reviews)

Jaspal Kharbanda

"What is really impressive was a value-driven engagement with Appify. They genuinely care about delivering quality."

Stephen Gribben

"Appify have become more than just my tech partner... Their communication led to seamless collaboration."

Leave a Review

Find Us

Google MapsGet Directions
LinkedInYouTubeInstagramTikTokFacebook
© Appify Digital 2026
  1. Back to blog
AI Engineering

What actually moves AI unit economics in 2026

Most AI feature budgets pick the wrong fight. Caching, retrieval and tier routing move the bill far more than model choice. Here's the architecture that wins.

For: CTOs, engineering leads, and Heads of Data running production AI features

AI
Appify Intelligence Team
|24 April 2026|7 minutes
An old wooden abacus with shiny beads next to an accounting ledger on a rustic wooden table

Most teams that ship a production AI feature spend the first three months model-shopping and the next three months wondering why the feature costs 40x what the demo suggested. The answer is rarely the model.

Three numbers set up the argument. First, moving within Claude's own line from the most expensive model (Opus 4.7) down to the cheapest (Haiku 4.5) cuts cost by 5x on both input and output tokens, per Anthropic's own pricing. Second, picking between top-tier frontier models across providers (Opus, GPT, Gemini) only moves cost by around 30%. Third, two teams running the same model against the same traffic routinely see a 5x to 10x cost gap between them, driven entirely by how the feature is architected. The biggest lever is architecture.

This post is for engineering leads, CTOs and heads of data who have either shipped an AI feature or are about to, and are staring at the API bill wondering where the money is going. Four levers dwarf model choice at the per-request scale. If you have not touched these, the diagnostic is straightforward: you have headroom.

1. Prompt caching is the highest-impact setting you have

Cache reads on Claude cost $0.50 per million tokens on Opus 4.7, $0.30 on Sonnet 4.6, $0.10 on Haiku 4.5. The standard input prices are $5, $3 and $1. That is a 10x discount on cache hits, and it applies to every token the model has seen before in a stable prefix.

OpenAI's prompt caching delivers up to 90% off cached input tokens and 80% lower time-to-first-token on prompts over 1,024 tokens. It is automatic. AWS Bedrock's equivalent gives roughly 90% off on Claude cache reads, and Google Gemini offers context caching at roughly 10 to 20 percent of standard input price plus a small hourly storage fee.

The economics look like this. A 10,000-token system prompt, stable across every call, costs $0.03 per request uncached on Sonnet 4.6. The cache-hit price is $0.003. A feature making 100,000 calls a month on that prefix alone goes from $3,000 to $300.

Why cache hit rates collapse in production

The catch is that caching only works when the prefix is byte-identical across calls. Tool definitions that change every request, system prompts that interpolate the current timestamp, JSON tool outputs with non-deterministic key ordering: each of these invalidates the prefix and silently drops cache hit rate to zero. The engineering discipline is keeping the first N tokens of every request stable. That is a half-day refactor on most codebases. Teams who have not done it are paying full price on tokens they already paid the model to read once.

For longer-lived prefixes, Anthropic offers a 1-hour cache TTL (time-to-live) at 2x the standard write cost. Whether to opt in depends on your traffic shape. A feature handling steady QPS (queries per second) amortises the 5-minute default fine. A feature with bursty, sparse traffic is losing cache hits in the gaps between requests and should pay for the longer TTL.

2. Retrieval quality is the second lever and it compounds

The default RAG (retrieval-augmented generation) recipe, chunk, embed, top-k cosine, prompt, is a flat starting point. Anthropic's Contextual Retrieval work shows why.

Their benchmark: contextual embeddings plus contextual BM25 cut the top-20 retrieval failure rate by 49%, from 5.7% to 2.9%. Adding a reranker drops the failure rate to 1.9%, a 67% total reduction. The engineering cost of the preprocessing step, with prompt caching, is $1.02 per million document tokens, one-time at index time.

We wrote about five concrete mistakes in SME retrieval earlier this week. The unit-economics angle is sharper. A RAG system with better retrieval sends shorter, more relevant context to the model, which is both cheaper per request and more likely to produce an answer a human trusts without double-checking. Precision at k flows straight into the bill: a 2x improvement in retrieval quality roughly halves the context size you need to ship, which roughly halves the input cost, which stacks with whatever caching discount applies.

You cannot spend your way out of bad retrieval

The converse is also true. Teams whose retrieval is bad respond by expanding context. Token cost rises linearly. Accuracy does not. You can spend your way out of bad retrieval up to a point. Past that point the accuracy curve stops moving and the cost curve keeps climbing.

3. Tier routing is where model choice actually matters

Within a tier, the choice of frontier model is a minor optimisation. Across tiers it is a 5x cost difference and a large opportunity for most production workloads.

The public evidence is RouteLLM, which showed an 85% cost reduction on MT-Bench at 95% of GPT-4 quality by routing most queries to a cheaper model and escalating the hard ones. On MMLU the reduction was 45%; on GSM8K, 35%. Benchmark-specific numbers, but the pattern holds: for many production workloads a sizeable fraction of queries do not need the top-tier model.

What the routing work looks like in practice

In practice, the routing work looks like this. Classify queries into the categories your product actually has: summarise, extract structured fields, answer factual questions, write copy, chain tool calls. Run each category against Haiku and Sonnet on a real test set. Route everything that passes the Haiku quality bar to Haiku. Route everything that needs the step up to Sonnet. Send only the residual reasoning-heavy work to Opus. The same decomposition works on OpenAI's tier ladder and on Gemini's, and across providers if your stack tolerates it.

This is the lever that most cleanly justifies a week of engineering time. A routing layer that moves 60% of Sonnet traffic to Haiku cuts input cost on that feature by roughly 40%, compounding with caching. The routing layer itself is a few days of work. It pays back on day one.

4. Long context is a trap unless you respect the attention budget

The frontier models advertise 1 million token context windows. Anthropic's engineering team describes context as an "attention budget" that depletes with every token: more tokens in the window means less accurate recall. Databricks' long-context RAG benchmark found frontier models hold up past 64k tokens, but older and smaller models degrade well before that. The Lost in the Middle paper documented that relevant information in the middle of a long context is systematically under-attended to. The frontier has partially closed that gap, though it has not eliminated it.

The cost curve is linear. A million-token call costs 100x a 10,000-token call on the same model. Caching helps on stable prefixes. It does not help on the variable retrieval payload that is the reason anyone ships long context in the first place.

Anthropic's own guidance, unchanged through several context-window expansions, is that if your knowledge base is under 200,000 tokens (about 500 pages), skip retrieval and stuff it. Above that, retrieve. That rule is still correct in April 2026. Below the threshold, a RAG pipeline is engineering cost for no benefit. Above the threshold, a 10,000-token retrieval plus a well-tuned reranker beats a 500,000-token stuff on cost, latency and accuracy.

A diagnostic you can run this week

If you want a concrete test of whether you have any of this headroom, here is the smallest useful version.

Pull your API bill for the last 30 days. For the single highest-volume AI feature, answer four questions.

First, what percentage of your input tokens are billed at the cache-read rate? If the number is under 40%, you have caching headroom. If the number is zero, caching is not on.

Second, for a sample of 50 real production queries, would Haiku's answer be acceptable on at least half? If yes, you have tier-routing headroom in the 30 to 60 percent range on that feature.

Third, if the feature uses retrieval, what is your precision at 5 on a held-out test set of 30 questions? If you do not have a test set, that is its own answer: retrieval quality is not being measured and the cost of bad retrieval is already in your bill.

Fourth, what is the median context size on a request? If it is growing over time and answer quality is flat, you are buying accuracy with tokens and the curve is working against you.

None of these are architectural changes. They are diagnostic questions. The answers tell you which of the four levers is yours to pull first.

Where Appify sits on this

Our RAG and AI consulting practices run this exact exercise with mid-market teams already shipping AI. Our experience: caching discipline, retrieval quality, tier routing and context posture together carry more of the P&L case than any frontier-model comparison a team can run. Model selection matters at the tier boundary. Inside a tier, architecture is where the bill moves.

If your AI feature is live and you think its unit economics are off, book a call. We can usually name which of the four levers is your biggest lift within an hour of reading your stack.

Tagged

ai-engineeringunit-economicsprompt-cachingcost-optimizationllm-architecture

Ready to talk?

If this post maps to a problem you're hitting, we'd like to hear about it. We turn AI experiments into production systems.

Start a conversation

Related articles

Rows of server racks in a blue-lit data centre with cabling and status LEDs

AI Engineering

RAG is not a search bar: five mistakes we keep seeing in SME retrieval

Most SME RAG systems break at the boring edges, not the vector database. Five mistakes we keep finding in other teams' retrieval, and the cheap fixes.