What actually moves AI unit economics in 2026

Most teams that ship a production AI feature spend the first three months model-shopping and the next three months wondering why the feature costs 40x what the demo suggested. The answer is rarely the model.

Three numbers set up the argument. First, moving within Claude's own line from the most expensive model (Opus 4.7) down to the cheapest (Haiku 4.5) cuts cost by 5x on both input and output tokens, per Anthropic's own pricing. Second, picking between top-tier frontier models across providers (Opus, GPT, Gemini) only moves cost by around 30%. Third, two teams running the same model against the same traffic routinely see a 5x to 10x cost gap between them, driven entirely by how the feature is architected. The biggest lever is architecture.

This post is for engineering leads, CTOs and heads of data who have either shipped an AI feature or are about to, and are staring at the API bill wondering where the money is going. Four levers dwarf model choice at the per-request scale. If you have not touched these, the diagnostic is straightforward: you have headroom.

1. Prompt caching is the highest-impact setting you have

Cache reads on Claude cost $0.50 per million tokens on Opus 4.7, $0.30 on Sonnet 4.6, $0.10 on Haiku 4.5. The standard input prices are $5, $3 and $1. That is a 10x discount on cache hits, and it applies to every token the model has seen before in a stable prefix.

OpenAI's prompt caching delivers up to 90% off cached input tokens and 80% lower time-to-first-token on prompts over 1,024 tokens. It is automatic. AWS Bedrock's equivalent gives roughly 90% off on Claude cache reads, and Google Gemini offers context caching at roughly 10 to 20 percent of standard input price plus a small hourly storage fee.

The economics look like this. A 10,000-token system prompt, stable across every call, costs $0.03 per request uncached on Sonnet 4.6. The cache-hit price is $0.003. A feature making 100,000 calls a month on that prefix alone goes from $3,000 to $300.

Why cache hit rates collapse in production

The catch is that caching only works when the prefix is byte-identical across calls. Tool definitions that change every request, system prompts that interpolate the current timestamp, JSON tool outputs with non-deterministic key ordering: each of these invalidates the prefix and silently drops cache hit rate to zero. The engineering discipline is keeping the first N tokens of every request stable. That is a half-day refactor on most codebases. Teams who have not done it are paying full price on tokens they already paid the model to read once.

For longer-lived prefixes, Anthropic offers a 1-hour cache TTL (time-to-live) at 2x the standard write cost. Whether to opt in depends on your traffic shape. A feature handling steady QPS (queries per second) amortises the 5-minute default fine. A feature with bursty, sparse traffic is losing cache hits in the gaps between requests and should pay for the longer TTL.

2. Retrieval quality is the second lever and it compounds

The default RAG (retrieval-augmented generation) recipe, chunk, embed, top-k cosine, prompt, is a flat starting point. Anthropic's Contextual Retrieval work shows why.

Their benchmark: contextual embeddings plus contextual BM25 cut the top-20 retrieval failure rate by 49%, from 5.7% to 2.9%. Adding a reranker drops the failure rate to 1.9%, a 67% total reduction. The engineering cost of the preprocessing step, with prompt caching, is $1.02 per million document tokens, one-time at index time.

We wrote about five concrete mistakes in SME retrieval earlier this week. The unit-economics angle is sharper. A RAG system with better retrieval sends shorter, more relevant context to the model, which is both cheaper per request and more likely to produce an answer a human trusts without double-checking. Precision at k flows straight into the bill: a 2x improvement in retrieval quality roughly halves the context size you need to ship, which roughly halves the input cost, which stacks with whatever caching discount applies.

You cannot spend your way out of bad retrieval

The converse is also true. Teams whose retrieval is bad respond by expanding context. Token cost rises linearly. Accuracy does not. You can spend your way out of bad retrieval up to a point. Past that point the accuracy curve stops moving and the cost curve keeps climbing.

3. Tier routing is where model choice actually matters

Within a tier, the choice of frontier model is a minor optimisation. Across tiers it is a 5x cost difference and a large opportunity for most production workloads.

The public evidence is RouteLLM, which showed an 85% cost reduction on MT-Bench at 95% of GPT-4 quality by routing most queries to a cheaper model and escalating the hard ones. On MMLU the reduction was 45%; on GSM8K, 35%. Benchmark-specific numbers, but the pattern holds: for many production workloads a sizeable fraction of queries do not need the top-tier model.

What the routing work looks like in practice

In practice, the routing work looks like this. Classify queries into the categories your product actually has: summarise, extract structured fields, answer factual questions, write copy, chain tool calls. Run each category against Haiku and Sonnet on a real test set. Route everything that passes the Haiku quality bar to Haiku. Route everything that needs the step up to Sonnet. Send only the residual reasoning-heavy work to Opus. The same decomposition works on OpenAI's tier ladder and on Gemini's, and across providers if your stack tolerates it.

This is the lever that most cleanly justifies a week of engineering time. A routing layer that moves 60% of Sonnet traffic to Haiku cuts input cost on that feature by roughly 40%, compounding with caching. The routing layer itself is a few days of work. It pays back on day one.

4. Long context is a trap unless you respect the attention budget

The frontier models advertise 1 million token context windows. Anthropic's engineering team describes context as an "attention budget" that depletes with every token: more tokens in the window means less accurate recall. Databricks' long-context RAG benchmark found frontier models hold up past 64k tokens, but older and smaller models degrade well before that. The Lost in the Middle paper documented that relevant information in the middle of a long context is systematically under-attended to. The frontier has partially closed that gap, though it has not eliminated it.

The cost curve is linear. A million-token call costs 100x a 10,000-token call on the same model. Caching helps on stable prefixes. It does not help on the variable retrieval payload that is the reason anyone ships long context in the first place.

Anthropic's own guidance, unchanged through several context-window expansions, is that if your knowledge base is under 200,000 tokens (about 500 pages), skip retrieval and stuff it. Above that, retrieve. That rule is still correct in April 2026. Below the threshold, a RAG pipeline is engineering cost for no benefit. Above the threshold, a 10,000-token retrieval plus a well-tuned reranker beats a 500,000-token stuff on cost, latency and accuracy.

A diagnostic you can run this week

If you want a concrete test of whether you have any of this headroom, here is the smallest useful version.

Pull your API bill for the last 30 days. For the single highest-volume AI feature, answer four questions.

First, what percentage of your input tokens are billed at the cache-read rate? If the number is under 40%, you have caching headroom. If the number is zero, caching is not on.

Second, for a sample of 50 real production queries, would Haiku's answer be acceptable on at least half? If yes, you have tier-routing headroom in the 30 to 60 percent range on that feature.

Third, if the feature uses retrieval, what is your precision at 5 on a held-out test set of 30 questions? If you do not have a test set, that is its own answer: retrieval quality is not being measured and the cost of bad retrieval is already in your bill.

Fourth, what is the median context size on a request? If it is growing over time and answer quality is flat, you are buying accuracy with tokens and the curve is working against you.

None of these are architectural changes. They are diagnostic questions. The answers tell you which of the four levers is yours to pull first.

Where Appify sits on this

Our RAG and AI consulting practices run this exact exercise with mid-market teams already shipping AI. Our experience: caching discipline, retrieval quality, tier routing and context posture together carry more of the P&L case than any frontier-model comparison a team can run. Model selection matters at the tier boundary. Inside a tier, architecture is where the bill moves.

If your AI feature is live and you think its unit economics are off, book a call. We can usually name which of the four levers is your biggest lift within an hour of reading your stack.

1. Prompt caching is the highest-impact setting you have

Why cache hit rates collapse in production

2. Retrieval quality is the second lever and it compounds

The default RAG (retrieval-augmented generation) recipe, chunk, embed, top-k cosine, prompt, is a flat starting point. Anthropic's Contextual Retrieval work shows why.

You cannot spend your way out of bad retrieval

3. Tier routing is where model choice actually matters

Within a tier, the choice of frontier model is a minor optimisation. Across tiers it is a 5x cost difference and a large opportunity for most production workloads.

What the routing work looks like in practice

4. Long context is a trap unless you respect the attention budget

A diagnostic you can run this week

If you want a concrete test of whether you have any of this headroom, here is the smallest useful version.

Pull your API bill for the last 30 days. For the single highest-volume AI feature, answer four questions.

First, what percentage of your input tokens are billed at the cache-read rate? If the number is under 40%, you have caching headroom. If the number is zero, caching is not on.

Second, for a sample of 50 real production queries, would Haiku's answer be acceptable on at least half? If yes, you have tier-routing headroom in the 30 to 60 percent range on that feature.

Fourth, what is the median context size on a request? If it is growing over time and answer quality is flat, you are buying accuracy with tokens and the curve is working against you.

None of these are architectural changes. They are diagnostic questions. The answers tell you which of the four levers is yours to pull first.

Where Appify sits on this

If your AI feature is live and you think its unit economics are off, book a call. We can usually name which of the four levers is your biggest lift within an hour of reading your stack.

What actually moves AI unit economics in 2026

1. Prompt caching is the highest-impact setting you have

Why cache hit rates collapse in production

2. Retrieval quality is the second lever and it compounds

You cannot spend your way out of bad retrieval

3. Tier routing is where model choice actually matters

What the routing work looks like in practice

4. Long context is a trap unless you respect the attention budget

A diagnostic you can run this week

Where Appify sits on this

Ready to talk?

Related articles

RAG is not a search bar: five mistakes we keep seeing in SME retrieval

What actually moves AI unit economics in 2026

1. Prompt caching is the highest-impact setting you have

Why cache hit rates collapse in production

2. Retrieval quality is the second lever and it compounds

You cannot spend your way out of bad retrieval

3. Tier routing is where model choice actually matters

What the routing work looks like in practice

4. Long context is a trap unless you respect the attention budget

A diagnostic you can run this week

Where Appify sits on this

Ready to talk?

Related articles

RAG is not a search bar: five mistakes we keep seeing in SME retrieval