OpenAI vs Anthropic API for production: what actually broke for us

When your AI feature is live and the monthly bill from OpenAI or Anthropic crosses a threshold that wants explaining at a finance review, the natural question is "are we on the right one?" The honest answer is per-use-case, and the answer changes once a real workload runs through both APIs and the bill comes in.

This post is the per-use-case follow-up to what actually moves AI unit economics in 2026. That piece argued that architecture (caching discipline, retrieval quality, tier routing, context posture) moves the bill far more than model choice. It does. Once those levers are pulled, the residual vendor choice still matters, and it splits along four use cases: chat, RAG (retrieval-augmented generation), agent loops, and batch extraction. The reader for this one is a founder, CEO or Head of Operations at a UK, Irish, or EU company whose AI feature lives on one of these two APIs, whose monthly bill is north of GBP 5,000, and who needs to know whether the team is making the right call on the next 12 months of vendor lock-in. The post is the framework. Hand it to the engineering team if you want; the point is that the person paying the bill needs to recognise the levers.

All pricing figures below are dated, with the source linked inline. Pricing tables move every quarter. Treat this post as a snapshot of May 2026; the model names will have shifted by August.

The landscape in one paragraph

Both vendors charge similar list prices on the flagship tier and within a factor of two on the mid tier. As of 2026-05-13, Anthropic lists Claude Opus 4.7 at $5 input and $25 output per million tokens, Sonnet 4.6 at $3 / $15, and Haiku 4.5 at $1 / $5. OpenAI lists GPT-5.5 at $5 input and $30 output, GPT-5.4 at $2.50 / $15, and GPT-5.4-mini at $0.75 / $4.50. Within a tier the prices cluster. The bill diverges on cache mechanics, failure-mode handling, latency posture, and where the data sits.

Use case 1: synchronous chat with a short stable system prompt

The shape: a user-facing chat surface, a stable system prompt under 4,000 tokens, a moving conversation tail. Volume in the hundreds of thousands of calls a month per tenant.

Latency is the dominant constraint. The user is reading characters as they stream; time-to-first-token (TTFT) drives the felt response time more than total tokens generated.

April 2026 third-party probes from DigitalApplied's benchmark put Claude Sonnet 4.6 at 0.74s p50 TTFT and 104 output tokens per second, against GPT-5.5 standard at 1.12s p50 and 92 tokens per second. p95 inflates to 1.61s for Sonnet 4.6 and 2.41s for GPT-5.5. Treat these as directional, not vendor SLOs (service-level objectives); both shift week to week.

The cache picture is where the cost actually moves. Below the cache-write threshold, neither vendor lets you cache. Anthropic's threshold for Sonnet 4.6 is 2,048 tokens, per their prompt caching docs. For Opus 4.7 and Haiku 4.5 it is 4,096 tokens. OpenAI's threshold is 1,024 tokens, automatic, with the company claiming up to 90% off cached input and 80% lower time-to-first-token. Anthropic's cache reads are explicitly 10% of base input price (a 90% discount), with 5-minute cache writes at 1.25x base and 1-hour writes at 2x base.

For a chat product with a 3,000-token system prompt and a 5-minute average gap between consecutive user turns, both APIs cache. Anthropic's 5-minute TTL (time-to-live) default matches typical chat session cadence and the math works out cheaper per call once a session has more than one turn. For sparse traffic with multi-hour gaps, neither default helps; Anthropic's 1-hour write lets you pay 2x the input price once to skip full-price reads for the next hour. OpenAI does not currently expose a comparable explicit TTL knob.

The tokenizer trap on Opus 4.7

Anthropic Opus 4.7 uses a new tokenizer that may consume up to 35% more tokens for the same fixed text than earlier Claude versions. Cost models built on Opus 4.6 token counts silently overshoot by a third when migrated forward without re-measuring.

When to pick which: for a chat surface that holds a session warm, Anthropic Sonnet 4.6 currently wins on per-message TTFT and on cache-hit economics. For chat with bursty, cold sessions where the system prompt is under 2,048 tokens, OpenAI's lower caching threshold is the deciding feature.

Use case 2: RAG, single-shot retrieval-augmented call

The shape: a query lands, retrieval pulls 6 to 12 chunks, the model answers grounded in those chunks. No multi-turn state. The system prompt and tool definitions are stable; the retrieval payload is variable.

Cache hit rate is the lever. Only the stable prefix caches. A naive RAG implementation that interpolates a timestamp or a session id into the system prompt drops cache hit rate to zero across all vendors.

For a 4,000-token stable system prompt that does cache and a 6,000-token retrieval payload that does not, the per-call cost on Anthropic Sonnet 4.6 in 2026-05 is roughly: cache read of 4,000 tokens at $0.30/MTok = $0.0012, plus 6,000 fresh input tokens at $3/MTok = $0.018, plus 600 output tokens at $15/MTok = $0.009. Total: about $0.028 per call. On GPT-5.4 the same shape costs: cached input 4,000 tokens at $0.25/MTok = $0.001, plus 6,000 fresh at $2.50/MTok = $0.015, plus 600 output at $15/MTok = $0.009. Total: about $0.025 per call. Within 12%. At 1 million calls a month the gap is $3,000.

The differentiator is what the post-retrieval model does when retrieval fails. Anthropic Sonnet 4.6 currently abstains more often when retrieval returns weak chunks. GPT-5.4 currently produces more confident but more occasionally hallucinated answers. We have seen both patterns in production. The right test is to wire your real eval set, as discussed in what production AI evals look like, and grade refusal vs hallucination on out-of-distribution queries. Neither failure mode is wrong in the abstract. They are wrong in different ways for different products.

EU data residency matters here for a UK, Irish, or EU mid-market team. OpenAI shipped European data residency on the API with zero data retention in February 2025, routing requests in-region for eligible projects. Anthropic's first-party API still defaults to US infrastructure; EU data residency on Claude today routes through AWS Bedrock or Google Vertex AI, per Anthropic's privacy centre. Zero Data Retention is available on Anthropic via DPA (data processing addendum), but it is not the same as in-region processing. If your DPO (data protection officer) requires data not to leave the EU at all, OpenAI's first-party endpoints clear that bar faster than Anthropic's first-party endpoints do.

When to pick which: for cost-sensitive RAG at high volume on stable prefixes, the two vendors are within 12% on list. For RAG inside an EU data-residency requirement on the first-party API, OpenAI is the simpler answer in May 2026. For RAG where conservative refusal beats confident hallucination, Anthropic is the safer default.

Use case 3: agent loop with tool calls

The shape: the model plans, calls tools, reads results, calls more tools, returns a final answer. Typically 4 to 15 model invocations per user task. Failure modes compound across turns.

Two features drive vendor choice here.

First, structured output guarantees. OpenAI's Structured Outputs in strict mode is documented to produce JSON that matches the supplied schema, with the caveat that it breaks if the response hits max_tokens or if parallel tool calls are enabled. In practice this means tool-call dispatch can rely on the shape of the response and skip a defensive parse-or-retry layer. Anthropic does not offer a comparable strict-mode guarantee on tool outputs as of May 2026; the model returns tool calls in the documented schema with high reliability, but no machine-checked guarantee. For an agent that calls 10 tools in sequence and where any malformed call kills the loop, OpenAI's strict mode removes a class of bugs.

Second, prompt caching across iterations. Agent loops re-send the full conversation history (system prompt, tool definitions, tool call results so far) on every turn. The prefix grows but stays largely stable. Anthropic's 5-minute cache default lands perfectly on typical agent loop cadences. With caching enabled and a 6,000-token stable prefix plus 2,000 tokens of new turn content, the third and later turns on Sonnet 4.6 pay cache-read price ($0.30/MTok) on 6,000 tokens and full input price on 2,000. That is the single biggest reason agent-loop costs on Anthropic with cache discipline tend to come in under naive OpenAI agent costs without caching discipline.

Failure mode to watch: tool definitions that change across iterations bust the cache. Anthropic's tool-use system prompt adds 346 tokens for auto tool choice on Claude 4.x models, per the pricing docs. Constant tool definition churn pays the full read price every turn.

When to pick which: for an agent loop where strict-schema tool calls are load-bearing and breaking the format kills the run, OpenAI with Structured Outputs. For an agent loop dominated by long prefixes, repeated turns, and bursty traffic where 1-hour cache TTL would save real money, Anthropic.

Use case 4: batch extraction

The shape: process 100,000 PDFs, emails, contracts, or support tickets overnight. Throughput beats latency. Errors must be machine-detectable so the post-processor can retry or escalate.

The Batch APIs converge: Anthropic offers 50% off both input and output on asynchronous batch processing; OpenAI offers the same 50% off via Batch.

The decider is again Structured Outputs. A batch extraction job that returns 100,000 JSON objects against a fixed schema is exactly the workload OpenAI's strict mode was designed for. Every output that does not match the schema becomes a machine-detectable error instead of a downstream parsing nightmare. On Anthropic the same workload requires either a strict downstream validator (which you should have anyway) or a constrained prompt that asks for the same schema by convention; the reliability is good in practice without being contractual.

Batch with caching on Anthropic shows what happens when both discounts stack. A 100,000-document extraction on Sonnet 4.6 with a 5,000-token instruction prefix that caches and 2,000 tokens of document body per call: cache read 5,000 tokens at 50% batch discount on $0.30/MTok = $0.00075, plus 2,000 input tokens at 50% batch on $3/MTok = $0.003, plus 500 output tokens at 50% batch on $15/MTok = $0.00375. Total: about $0.0075 per document, or $750 for 100,000 documents. On GPT-5.4 the same shape with strict mode, Batch, and prompt caching applied to the 5,000-token prefix lands at roughly $0.0095 per document, or $950 for 100,000 documents. The two are within 25% on this workload; the choice is more about which schema-enforcement story matches your downstream consumer than about price.

When to pick which: if the workload is schema-driven extraction with a hard schema, OpenAI Batch with strict mode is the right default. If the workload runs over a stable instruction prefix and the schema is enforced downstream, Anthropic Batch with prompt caching is cheaper.

A word on EU data residency

The substrate here for a UK, Irish, or EU operator is hard. GDPR (General Data Protection Regulation) and a growing set of sectoral rules want personal data to stay in the EU; some regulated industries want it not to touch a US-headquartered processor at all. As of 2026-05-13:

OpenAI: first-party API with European data residency and Zero Data Retention is available on eligible Enterprise projects across Europe, the UK, and several other regions. The data sits at rest in-region. This is the cleanest setup if your DPO requires first-party EU processing.
Anthropic: first-party Claude API defaults to US. EU residency on Claude routes through AWS Bedrock or Google Vertex AI today. ZDR is available via DPA but is not the same as in-region processing. Anthropic has listed Microsoft Foundry EU support as "Coming 2026" on their regional compliance page.

If your enterprise legal team has flagged in-region processing as a hard requirement (some Irish public-sector and German manufacturing customers do), Anthropic's first-party API is currently the wrong default; route through Bedrock or Vertex, or pick OpenAI.

When neither is the answer

A non-trivial fraction of teams should not be standardising on either vendor's first-party API. The shape:

Heavy AWS workloads with existing IAM (identity and access management), VPC (virtual private cloud), and KMS (key management service) discipline: Bedrock gives you Claude and several open models with the same audit and access controls as the rest of your stack.
Heavy GCP workloads: Vertex AI is the equivalent and now offers Claude with full EU residency.
Workloads that need offline inference, predictable per-second cost, or models with custom fine-tunes the frontier labs do not offer: open weights on a managed inference platform (Together AI, Fireworks, AWS Bedrock open-model endpoints) win on cost at scale and on data control.

A team committing fully to OpenAI or Anthropic without checking whether one of these three exits applies is leaving 20% to 60% on the table at scale.

What we tell mid-market teams

Three things, in order.

First, pick by use case rather than by vendor brand. If your product has more than one of the four shapes above, you will run both APIs in production. The teams that ship reliable LLM features pick the right tool per workload and route accordingly.

Second, build the cache discipline before the vendor choice matters. Caching is the lever that moves the bill, on both vendors. A team without cache hit-rate telemetry is paying full price on tokens it has already paid the model to read.

Third, treat EU data residency as a contract requirement rather than a preference. Get your DPO into the conversation before you sign an API agreement. Switching after the fact is expensive.

Our AI consulting practice runs this exact diagnostic with mid-market teams that already ship production AI: per-workload vendor fit, cache hit rate, failure-mode mix, data-residency posture. The output is a one-page bill-by-bill cut showing which workload should sit on which API, with the GBP figure attached.

If you are spending more than GBP 5,000 a month on OpenAI or Anthropic and have not done that exercise in the last quarter, book a call. We can usually name the biggest miss within an hour of reading your stack.

This post will need a refresh in three months. Model names will shift, list prices will move, EU residency will catch up. The framework will not: pick the API by use case, build caching discipline before the vendor choice matters, get the data-residency call right early.

All pricing figures below are dated, with the source linked inline. Pricing tables move every quarter. Treat this post as a snapshot of May 2026; the model names will have shifted by August.

The landscape in one paragraph

Use case 1: synchronous chat with a short stable system prompt

The shape: a user-facing chat surface, a stable system prompt under 4,000 tokens, a moving conversation tail. Volume in the hundreds of thousands of calls a month per tenant.

Latency is the dominant constraint. The user is reading characters as they stream; time-to-first-token (TTFT) drives the felt response time more than total tokens generated.

The tokenizer trap on Opus 4.7

Use case 2: RAG, single-shot retrieval-augmented call

Use case 3: agent loop with tool calls

The shape: the model plans, calls tools, reads results, calls more tools, returns a final answer. Typically 4 to 15 model invocations per user task. Failure modes compound across turns.

Two features drive vendor choice here.

Use case 4: batch extraction

The shape: process 100,000 PDFs, emails, contracts, or support tickets overnight. Throughput beats latency. Errors must be machine-detectable so the post-processor can retry or escalate.

The Batch APIs converge: Anthropic offers 50% off both input and output on asynchronous batch processing; OpenAI offers the same 50% off via Batch.

A word on EU data residency

OpenAI: first-party API with European data residency and Zero Data Retention is available on eligible Enterprise projects across Europe, the UK, and several other regions. The data sits at rest in-region. This is the cleanest setup if your DPO requires first-party EU processing.
Anthropic: first-party Claude API defaults to US. EU residency on Claude routes through AWS Bedrock or Google Vertex AI today. ZDR is available via DPA but is not the same as in-region processing. Anthropic has listed Microsoft Foundry EU support as "Coming 2026" on their regional compliance page.

When neither is the answer

A non-trivial fraction of teams should not be standardising on either vendor's first-party API. The shape:

Heavy AWS workloads with existing IAM (identity and access management), VPC (virtual private cloud), and KMS (key management service) discipline: Bedrock gives you Claude and several open models with the same audit and access controls as the rest of your stack.
Heavy GCP workloads: Vertex AI is the equivalent and now offers Claude with full EU residency.
Workloads that need offline inference, predictable per-second cost, or models with custom fine-tunes the frontier labs do not offer: open weights on a managed inference platform (Together AI, Fireworks, AWS Bedrock open-model endpoints) win on cost at scale and on data control.

A team committing fully to OpenAI or Anthropic without checking whether one of these three exits applies is leaving 20% to 60% on the table at scale.

What we tell mid-market teams

Three things, in order.

Third, treat EU data residency as a contract requirement rather than a preference. Get your DPO into the conversation before you sign an API agreement. Switching after the fact is expensive.

OpenAI vs Anthropic API for production: what actually broke for us

The landscape in one paragraph

Use case 1: synchronous chat with a short stable system prompt

The tokenizer trap on Opus 4.7

Use case 2: RAG, single-shot retrieval-augmented call

Use case 3: agent loop with tool calls

Use case 4: batch extraction

A word on EU data residency

When neither is the answer

What we tell mid-market teams

Ready to talk?

Related articles

How to add AI to existing software without rebuilding

Agents in production: the control surface is the product

RAG is not a search bar: five mistakes we keep seeing in SME retrieval

OpenAI vs Anthropic API for production: what actually broke for us

The landscape in one paragraph

Use case 1: synchronous chat with a short stable system prompt

The tokenizer trap on Opus 4.7

Use case 2: RAG, single-shot retrieval-augmented call

Use case 3: agent loop with tool calls

Use case 4: batch extraction

A word on EU data residency

When neither is the answer

What we tell mid-market teams

Ready to talk?

Related articles

How to add AI to existing software without rebuilding

Agents in production: the control surface is the product

RAG is not a search bar: five mistakes we keep seeing in SME retrieval