How to add AI to existing software without rebuilding

BCG's 2025 Build for the Future report puts the AI failure rate at the population level: of 1,250 surveyed firms, 60% generate no material value from AI despite continued investment, and only 5% capture substantial value at scale. McKinsey's State of AI 2025 finds that the AI high performers' edge over the rest is "overwhelmingly organizational" - the gap is mostly in how the business is rewired around the model rather than in the model itself. The work the high performers do that the others skip is integration work.

Adding AI to a 5 to 20 year old line-of-business system - an ERP (enterprise resource planning), a SCADA (supervisory control and data acquisition) stack, a claims engine, a scheduling platform - is an integration architecture problem before it is a model selection problem. The model is a small piece of the build. The architecture decides whether the feature ships.

This post is for founders, CEOs and Heads of Operations whose engineering team, internal or agency, has been asked to add AI to a system the business depends on. The model decision is the easy part to delegate. The four parts that follow are the ones the business has to keep honest, because nobody else is going to.

The model is the cheap part

Within Anthropic's own line, moving from Opus 4.7 to Haiku 4.5 changes the per-token bill by 5x. Across providers at the same tier, the bill moves by about 30%. Inside a tier the model is a commodity to anyone with a cloud account.

Berkeley's BAIR group makes the same case from the research side. Their survey of state-of-the-art AI applications found that "state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models". Databricks data inside the same post: 60% of LLM applications use retrieval-augmented generation, 30% use multi-step chains. The model is one node in a graph of code, retrieval, eval, and human checkpoints. Replacing the model is a config change. Replacing the graph is a rewrite.

One honest counter-case. Some problems are model-bound: novel mathematical reasoning, frontier R&D, specialist coding work. If your AI feature is a research-grade reasoning engine, model choice matters more than the post below suggests. For a typical mid-market integration where the AI summarises case notes, drafts outbound emails, classifies tickets, ranks leads or extracts fields from PDFs, the architecture is the work.

The four parts of an AI overlay

An overlay is the term we use for AI capability bolted onto a legacy system without touching its core. Two older architecture patterns describe the shape, both documented in the Azure architecture catalogue.

The Strangler Fig pattern is the high-level shape. "A facade (proxy) intercepts requests that go to the back-end legacy system. The facade routes these requests either to the legacy application or to the new services." You add the AI overlay as a new service behind that facade, route a small slice of traffic to it, expand the slice as the AI proves itself, and never touch the legacy core.

The Anti-Corruption Layer pattern, originally from Eric Evans' Domain-Driven Design, is the protocol detail underneath. The layer translates between the legacy data model (SOAP, IDocs, fixed-width flat files, EDIFACT) and the AI overlay's JSON. Microsoft's wording: it lets "one system remain unchanged while the other can avoid compromising its design and technological approach". Translation lives in one place. Both systems keep their own shape.

Build those two patterns into the overlay and four practical parts fall out.

1. API surface

The hardest decision is which seam in the legacy system the overlay attaches to. Most teams get this wrong on the first pass. They wire the AI into the UI layer because that is where humans see the output, and end up shipping a feature that cannot read or write any data that matters. The right seam is usually deeper: a service bus message, a stored procedure, a queue event, an outbound webhook. Wherever the legacy system already produces a clean record of work happening.

For an ERP the seam is often the outbound document event (purchase order created, invoice posted). For SCADA the seam is the historian or the alarm stream rather than the HMI (human-machine interface). For a claims system the seam is the FNOL (first notice of loss) record or the daily case-summary feed.

Pick one seam. Define a single, narrow, versioned contract for the overlay to consume from it. An overlay with a tight contract is a feature you can roll back. An overlay wired across six surfaces is a rewrite waiting to happen.

2. Rollback plan

Every senior engineer knows the rollback question for a code deploy. Almost no one asks it of an AI feature, which is why so many AI features ship as one-way doors. The discipline is the same as a normal feature flag and canary rollout: ship the AI behind a flag, route a small cohort to it, compare outcomes against a control, expand when the numbers hold, kill the flag when they do not. Feature flag tooling is mature and cheap; pick one and wire it in before the first model call.

The discipline gap is treating the AI output as a code path you can turn off without redeploying the host application. A practical version: the overlay reads from the legacy seam, generates a candidate answer, writes it to a side table; the host application reads the side table only when the flag is on. Turning the flag off restores the pre-AI behaviour with zero application code changes.

The overlay logs every input, output, and decision. The host application never blocks on an AI call. If the overlay times out or errors, the legacy path serves the request and the failure goes into the log. An AI overlay that can take down your existing software is not an overlay; it is a dependency you accidentally signed up for.

3. Eval harness

This is the part teams skip most often, and the part that decides whether the AI is allowed to ship at all. A production AI feature without an eval harness is a feature whose failure mode stays invisible until a customer finds it. We have written about what production AI evals look like in more detail; for an integration project the short version is three layers.

A frozen golden set with known-correct outputs, run on every prompt or code change, target near-100% pass rate. An LLM-as-judge rubric scoring open-ended outputs against a written standard, with a small weekly human-graded sample to keep the judge honest. Production telemetry feeding the eval set, so the failures users actually see become the next regression test.

The 2026 vendor list is short. LangSmith, Braintrust and Promptfoo are the three we see most often in mid-market builds. Pick one before you write the second prompt. Anthropic's Building Effective Agents puts the discipline plainly: "the key to success, as with any LLM features, is measuring performance and iterating on implementations".

The harness is also the part of the architecture that lets you swap the model. Without it the model is interchangeable in theory only. Generic benchmark scores do not tell you whether Haiku is good enough on your data; your eval suite does. The teams that move freely between models have a harness. The teams stuck on whatever they shipped first do not.

4. Named owner on the business side

The piece a CTO cannot solve from the engineering side alone. Every AI overlay needs a named accountable owner from the business unit it touches. That owner sets the success metric (faster case resolution, lower handle time, higher first-call match rate), owns the kill criterion, and signs off on the rubric the LLM-as-judge uses.

McKinsey's high performers, per State of AI 2025, "are more likely than others to say their organizations have defined processes to determine how and when model outputs need human validation". The teams that ship AI have agreed on what good output looks like before the AI is built. The teams that stall are still arguing about it after the model has shipped twice.

For the integration in front of you, the owner is whichever business leader currently owns the legacy workflow the AI is augmenting. Head of Claims for the claims AI. Head of Operations for the SCADA overlay. Head of Sales Ops for the lead-routing AI. The CTO owns the engineering. The business owns the metric.

The cost shape of the build

For a mid-market organisation adding an AI feature to a 5 to 20 year old core system, a typical engagement runs in the GBP 80 to 200k range across three phases: four to six weeks of discovery to pick the seam, six to twelve weeks of integration build to ship the overlay behind a flag, and a defined post-launch period to tune the prompts and the eval harness. The model API spend inside that is usually under 5% of the total. The architecture, the eval harness, the data plumbing and the change management are the rest, which our piece on what actually moves AI unit economics tracks in detail.

What blows the budget is almost always a missed seam or a missed owner. A team that wires the AI into the UI and then discovers the data they need lives in the service bus burns six weeks rebuilding the integration. A team without a named business owner discovers in week ten that the metric was never agreed and the rollout has no kill criterion. Both are avoidable by spending a week in discovery on the questions above before the first line of code.

What to do this week

If your board has asked for an AI feature on a legacy stack, the diagnostic is short.

Pick the seam. Name the single legacy event or record the overlay will read from, and write down the contract. If the seam is the UI, look deeper.

Name the business owner. The leader who will sign off on success metrics, accept escalations, and make the kill call. Without that name, do not start.

Sketch the rollback. How do you turn the AI off in production without redeploying the host application? If the answer involves a code change and a release, the architecture is wrong.

Pick the eval vendor. One of the three named above. Set up an account before writing the second prompt.

Save the model decision for last. Start with the cheapest tier of the family you already trust and let the eval harness tell you when to escalate.

We run this exercise as an AI Operations Audit with founders and operators whose business has a legacy system in production and a board ask for an AI feature. The first conversation is free, and we will tell you honestly whether the work is one quarter or three. As our CEO Karl Mulligan put it on stage in April 2026: productivity is easier than profit. The architecture decides which one you get.

The model is the cheap part