Honest AI automation ROI: three categories that hold past month six

"Can AI actually cut our admin cost, or are we about to spend a year proving it can't?" That is the operator question behind almost every AI automation procurement conversation in 2026. It is also the question the vendor decks answer with month-1 throughput numbers and the board answers, six months later, with a P&L that did not move.

This post is for a specific reader. A COO, CFO, or Head of Operations at a mid-market firm (50-500 staff, somewhere between GBP 10 million and GBP 200 million turnover) who has a board-approved AI automation programme on the desk and needs to defend year-1 numbers, not the demo numbers. The published 2025-2026 evidence base is now thick enough to answer the question honestly. Three categories of work hold their savings past month six. The rest mostly don't. Pricing the procurement against month-1 throughput is the single most common reason a board meeting six months in goes sideways.

What the 2025-2026 evidence actually says

Four data points anchor the conversation.

First, MIT's NANDA initiative published "The GenAI Divide: State of AI in Business 2025" in August 2025, based on 150 leader interviews, 350 employee surveys, and analysis of 300 public AI deployments. The headline number, which everyone has heard by now, is that 5% of enterprise AI pilots achieved rapid revenue acceleration; the rest stalled with little to no measurable P&L impact. The underneath finding, which fewer operators have internalised, is that back-office automation produced the biggest ROI in the sample, despite receiving less than half the budget. Generic productivity tools dominated spend. Specialised back-office automation dominated returns.

Second, BCG's AI Radar 2026, published 15 January 2026, reports that 60% of companies see minimal or no value (cost reduction or revenue gain) from AI despite significant effort, while 94% plan to continue investing. The AI Radar surveyed CEOs across 1,800 organisations. The gap between investment and value is the central finding. 90% of CEOs believe agentic AI is the bridge that closes it; trailblazer CEOs are allocating roughly 60% of their AI budgets to agentic, against 25% for everyone else.

Third, McKinsey's State of AI 2025, published November 2025, found that 62% of organisations are at least experimenting with AI agents, but only 39% report enterprise-level EBIT impact. The cost-and-revenue benefits show up at use-case level. The EBIT impact does not. The translation problem from use-case to P&L is now the named bottleneck.

Fourth, Klarna walked back its AI customer-service replacement in May 2025 after a 22% headcount reduction (down to 3,500 staff) and a 22% customer-satisfaction drop. CEO Sebastian Siemiatkowski told Bloomberg that "investing in the quality of human support is the way of the future for us." Klarna is the public canary because the company narrated the AI replacement loudly in 2024 and the reversal loudly in 2025; the same shape of walk-back is happening quietly in mid-market deployments where no one is doing the press release.

These four data points do not say AI automation is broken. They say something more specific. Pilots that demonstrate throughput in month one routinely fail to translate into board-visible P&L by month twelve. The shape of the gap is now well enough understood to plan around.

Why the savings cliff is structural, not a quality problem

The temptation reading those numbers is to blame the technology. A better-tuned model, a more capable agent, a stronger reasoning step will fix it. The 2026 evidence cuts against that read.

Addy Osmani's January 2026 essay on the 80% problem names the structural issue. AI agents reach 80% of working behaviour quickly. The remaining 20% is not minor cleanup. It is rate limiting, retry logic with backoff, observability, circuit breakers, audit logging, PII (personally identifiable information) handling, and the exception triage that follows from any of the above going wrong. Stanford's HAI 2026 AI Index reports AI agents on OSWorld at 66.3% accuracy, against operator targets in the 80-90% range. A 37% gap between lab benchmarks and production deployment is now the documented norm, not the exception.

The exception-handling overhead is the part that breaks the savings curve. A bot that automates 1,000 invoice approvals a week, with 90% straight-through processing, generates 100 exceptions a week that need a human reviewer. That reviewer is paid, trained, and sat at a desk; they need a queue, a triage rule, and an escalation path. Then the supplier portal changes its UI and the bot's straight-through rate falls from 90% to 70% overnight. Now there are 300 exceptions a week. The maintenance team that fixes the bot is paid too. Industry data on RPA maintenance (still the canonical comparable) puts annual maintenance at 20-30% of initial development cost; Forrester's 2020 figure of 45% of firms reporting weekly bot breakage is still cited because no one has published a more recent number that disagrees. AI agents inherit the same maintenance pattern with an added retraining cost when underlying models update.

The governance overhead compounds this. Under the EU AI Act, in force from August 2024 with phased obligations through August 2026, any AI system used to make or substantially inform employment, credit, or essential-service decisions falls under high-risk obligations: documentation, post-market monitoring, human oversight. The compliance cost is real and it is fixed; it does not scale down with the size of the automation. A mid-market firm absorbing a high-risk AI governance file pays roughly the same governance overhead as a FTSE 100 firm, against a smaller savings base.

Add the three together (exception triage, maintenance, governance) and the month-1 throughput number is structurally an overstatement of the year-1 run-rate. Not because the AI failed. Because the cost components that show up in months three through twelve are not in the demo.

Three categories that hold past month six

The 2025-2026 evidence does identify categories of work where the savings hold. They share three features: the volume is high, the input shape is stable, and the failure cost of a wrong answer is bounded.

1. High-volume identical-shape transactions. Invoice-to-pay against a stable supplier base, payroll runs, recurring purchase-order issuance, standard claims adjudication, KYC (know your customer) re-screening on existing accounts. The 90%+ straight-through rate that vendors quote is real on this category because the input distribution is narrow. The exception queue is small enough to staff at one or two people. The maintenance is mostly OCR (optical character recognition) drift and rule updates, not novel-pattern handling. The MIT NANDA "back-office automation" category that produced the biggest ROI is mostly this. BCG cites Foxconn's 200+ factory deployment at $400 million in identified savings as a published example; the mid-market equivalent is invoice automation at GBP 200,000-400,000 of recurring annual labour.

2. Deterministic validations on stable data formats. Document classification against a fixed taxonomy, regulatory-form field extraction where the regulator has not changed the form in three years, contract-clause spotting against a closed clause library, alert triage against a fixed rule set. The work has a defined right answer and a defined wrong answer. The AI either matches the regex-replaceable rule or it doesn't, and you can measure that without a human reviewer in the loop most of the time. This is the category where the contextual retrieval engineering pattern actually does what the vendor demo says it does.

3. After-hours coverage where the alternative is overtime or lost customer. Tier-one support after 6pm, weekend account-status questions, time-zone coverage for an international customer base, on-call alert triage between 11pm and 7am. The AI does not have to outperform a human; it has to outperform an empty queue. The unit economics work because the comparator is paid overtime, an outsourced contact centre at premium hours, or revenue lost when the customer abandons. Klarna's reversal happened on tier-one in primary business hours; the after-hours math is different.

These three categories share the structural property that makes the savings hold past month six. The exception queue is bounded by the input distribution, not by the AI's capability ceiling. As the AI improves, the exception queue shrinks. In the categories that don't hold, the exception queue is bounded by the world's distribution of weird cases, which is unbounded.

What doesn't hold past month six

The mirror image of the three above is the three operators keep trying to automate that don't.

Customer support requiring judgement or empathy. Klarna ran this experiment in public. The cost saving was real; the customer-satisfaction drop was real; the revenue impact of the satisfaction drop overwhelmed the saving. The walk-back is now happening in mid-market quietly, on the same shape: agent that handles 75% of contacts in month one, NPS (net promoter score) drop visible by month four, board conversation in month six.

Novel-shape document handling. Any process where the input distribution is genuinely long-tailed (legal contract review across novel templates, custom proposal drafting, multi-step exception handling on insurance claims) sits in the 30-40% real-world failure-rate band cited in the Inovabeing 2026 reliability analysis. The savings exist in month one because the volume is there. They erode because the human review rate has to be 100% on the long-tail cases, not the headline 10%.

Anything where the failure cost exceeds the labour saved. The asymmetry kills the unit economics regardless of average performance. Tax submissions, regulated medical-record updates, financial-controls overrides, anything that touches a publicly visible customer commitment. A 98% accurate bot that gets the 2% wrong on a tax filing costs more than the 98% saved. The fix is not a better bot; it is a different scope choice.

The counter-thesis: who is landing it

The pillar's argument is not that AI automation categorically fails. The published evidence undercuts that strong form. BCG's AI Radar 2026 identifies a top quintile of organisations capturing real EBIT impact. McKinsey reports that organisations investing $25 million or more in responsible-AI initiatives see EBIT impact above 5%. These are not invented numbers. They reflect organisations that landed the savings.

What is structurally different about them? Three things keep appearing in the case studies. First, they buy specialised tools from vendors who own a specific use case, not generic productivity platforms. The MIT NANDA finding that vendor partnerships succeed at 67% versus internal builds at one-third of that is the most consistent signal in the report. Second, they price the year-1 run-rate before the procurement signature, including the exception-handling headcount, the maintenance retainer, and the governance overhead. The board agreement is signed against that number, not the throughput demo. Third, the executive sponsor is a line manager with P&L accountability, not a central AI lab. BCG's trailblazer pattern puts the procurement decision in the hands of the operator whose numbers the AI is supposed to move.

A mid-market firm cannot match a FTSE 100's $25 million investment threshold. It does not have to. The same three disciplines apply at the lower scale: specialised vendor, year-1 run-rate pricing, line-manager sponsor. The size of the savings shrinks proportionally; the shape of the procurement that delivers them does not.

How to price year-1 run-rate, not month-1 throughput

The procurement discipline that survives the cliff is concrete. Five line items belong in the year-1 budget on any AI automation programme:

Licence or platform cost. The published number. UiPath in 2026 sits at roughly $4,000 per unattended bot per year on the entry plan, with mid-market deployments (10-50 bots) commonly in the $150,000-$500,000 annual range per Vendr's 2026 marketplace data; agentic AI agent pricing layers on top, typically $200-$600 per agent per month at platform tier. Quote the number for your scope.

Exception-handling staffing. Assume 10-15% of the throughput volume comes through the exception queue in month one, growing to 20-25% by month nine as the system encounters more edge cases. Staff to the month-nine number, not the month-one number. The cost of the exception handler is roughly equivalent to the labour the automation displaces; net savings is the displaced labour minus the exception-handler cost, not gross.

Maintenance retainer. 20-30% of build cost annually, per the Kognitos 2025 industry benchmark. This is the line item that vendor proposals most often leave off the comparison sheet. Insist on it as a separate quoted item.

Retraining and model-update overhead. When the underlying model updates (Anthropic, OpenAI, Google all ship quarterly), behaviour shifts. The QA (quality assurance) cycle to re-validate against your existing test set is roughly two engineer-weeks per major update. Four updates a year is realistic; budget eight engineer-weeks of annualised AI-engineer time.

Governance and audit overhead. Under the EU AI Act high-risk obligations, expect a documented impact assessment, post-market monitoring, and an annual external review. The fixed compliance cost lands at roughly GBP 30,000-80,000 a year at mid-market scale, depending on how many AI systems you operate and how mature your existing data governance is.

Add the five together. Compare against the labour displaced at a realistic month-twelve straight-through rate, not the month-one demo rate. If the net savings still justifies the procurement, sign. If it doesn't, the right answer is a smaller scope on a category that holds, not a more ambitious scope on a category that won't.

What to do this week

Three actions clear the fog before the next procurement conversation.

First, name the line item the AI is supposed to move. Not "admin cost," not "operational efficiency". A specific line on this quarter's management accounts. AR (accounts receivable) days, invoice-processing cost per unit, tier-one contact-centre cost per ticket, exception-rework hours per week. The line item discipline is what we wrote about previously in why most enterprise AI pilots don't move the P&L, and the discipline travels.

Second, classify the work category. If the input shape is identical across transactions, the savings category is real; price it. If the input distribution is long-tailed and the failure cost is asymmetric, the procurement case has not been built yet. Better to know that before signing than after the board meeting in month seven.

Third, get the year-1 run-rate number on paper before the vendor signature. Five line items. Add them up. Compare to the month-twelve labour displaced, not the month-one throughput. If the vendor will not quote the maintenance retainer or the exception-handling assumption as a separate line, that is its own answer about whose risk the project is on.

AI automation savings are real in 2026 and they are landing somewhere. The question for a mid-market operator is whether the procurement on your desk lands in the three categories that hold past month six, or the three that don't. The published evidence is now thick enough that the answer is not a guess. Price the year-1 number. The rest is downstream.

What the 2025-2026 evidence actually says

Four data points anchor the conversation.

Why the savings cliff is structural, not a quality problem

The temptation reading those numbers is to blame the technology. A better-tuned model, a more capable agent, a stronger reasoning step will fix it. The 2026 evidence cuts against that read.

Three categories that hold past month six

What doesn't hold past month six

The mirror image of the three above is the three operators keep trying to automate that don't.

The counter-thesis: who is landing it

How to price year-1 run-rate, not month-1 throughput

The procurement discipline that survives the cliff is concrete. Five line items belong in the year-1 budget on any AI automation programme:

What to do this week

Three actions clear the fog before the next procurement conversation.

Honest AI automation ROI: three categories that hold past month six

What the 2025-2026 evidence actually says

Why the savings cliff is structural, not a quality problem

Three categories that hold past month six

What doesn't hold past month six

The counter-thesis: who is landing it

How to price year-1 run-rate, not month-1 throughput

What to do this week

Ready to talk?

Related articles

How to add AI to existing software without rebuilding

AI agent for back-office operations: topology, observability, fallbacks

AI dashboards in 2026: which mode fits, and how to tell

Honest AI automation ROI: three categories that hold past month six

What the 2025-2026 evidence actually says

Why the savings cliff is structural, not a quality problem

Three categories that hold past month six

What doesn't hold past month six

The counter-thesis: who is landing it

How to price year-1 run-rate, not month-1 throughput

What to do this week

Ready to talk?

Related articles

How to add AI to existing software without rebuilding

AI agent for back-office operations: topology, observability, fallbacks

AI dashboards in 2026: which mode fits, and how to tell