Why 95% of enterprise AI pilots don't move the P&L (and what shipping teams do)

In August 2025, a report from MIT's NANDA project produced the statistic that ate every enterprise AI conversation for a quarter: 95% of generative AI initiatives at large companies showed no measurable return on the bottom line. The number went viral in the shape every LLM headline takes - "enterprise AI fails 95% of the time" - and reached the desk of every operator who had a pilot in-flight.

The number is real. The reading is wrong. And the teams that do ship AI into production in 2025 and 2026 share a handful of disciplines that the viral reading skips past, because "you need better evals and a clearer use-case" is a less exciting headline than "AI doesn't work".

What the MIT figure actually says

The MIT NANDA report reviewed around 300 public enterprise AI initiatives, interviewed 52 organisations, and polled roughly 150 senior leaders at four industry conferences. Its headline finding is precise in a way that most coverage dropped on the way to publication: of the surveyed generative AI initiatives, about 5% produced rapid revenue acceleration and the rest produced no measurable P&L impact within six months post-pilot.

That is a very specific window. It measures deployment beyond pilot with direct, attributable profit-and-loss effect at month six. It does not count efficiency gains, cost reductions, churn improvements, or faster lead conversion - any of which are real returns that do not show up as a P&L line item that fast. The report itself acknowledges its interview data is only "directionally accurate".

The lead author told Fortune that the 5% who did land "pick one pain point, execute well, and partner smartly with companies who use their tools". That is the finding underneath the finding. The 5% share a set of organisational habits that any competent team can adopt. Model choice isn't what's holding the other 95% back.

The same shape shows up in three other studies

Since the MIT piece dropped, three large independent studies have published data that agrees on the shape of the gap.

McKinsey's State of AI 2025 reports 88% of organisations using AI in at least one business function, 39% seeing enterprise-level EBIT (earnings before interest and tax) impact, and only 6% qualifying as AI high performers (those capturing more than 5% EBIT uplift from AI). McKinsey's own conclusion: the high performers' advantage is "overwhelmingly organisational, not technological".

BCG's Build for the Future 2025, a survey of 1,250 senior executives across nine industries and 25-plus sectors, sorts firms into 5% "future-built", 35% "scalers", and 60% "laggards". Laggards report minimal revenue and cost gains from AI; future-built firms show 1.7x revenue growth and 1.6x EBIT margin relative to laggards.

IBM's 2025 CEO Study of 2,000 CEOs finds that 25% of AI initiatives delivered expected ROI and only 16% scaled enterprise-wide.

Across four studies, with different methods and different time windows, the pattern is the same. A small minority of firms captures most of the value from AI. Frontier models are a commodity to anyone with a cloud account, so whatever sets that minority apart sits elsewhere.

The four disciplines that separate shipping teams

Every one of those four studies, plus a November 2025 HBR piece titled Most AI Initiatives Fail, names the same cluster of disciplines in the top-tier teams. Four show up in every source.

Use-case selection

Top-tier teams concentrate AI spend on a narrow set of high-impact problems. They pick a use case that is expensive today, has a clear business metric attached, and has a named owner on the business side who can say when to stop. Teams stuck at POC (proof-of-concept) usually have the opposite: a portfolio of ten small experiments, no kill criteria, and no owner who will tell the exec team which five to cancel. MIT Sloan's own guidance on opportunity prioritisation says the same thing from the academic side.

Evaluation

Shipping teams measure what their AI system does in production, at the task level, with metrics that predate the AI system. The frontier labs make this case by example. Anthropic released Bloom in late 2025, an agentic framework for automated behavioural evaluations of frontier LLMs, and ran a joint cross-lab evaluation with OpenAI in mid-2025. If the labs that built the models think evaluation is the hard problem, it is the hard problem for anyone building on top of those models. A pilot with no evaluation loop is a pilot whose failure mode is invisible until a customer finds it.

Governance

Operational governance, to be clear. A named accountable person, a risk register that maps to NIST's AI Risk Management Framework, a human-in-the-loop checkpoint wherever the system can cause reputational, legal, or financial harm. Gartner's June 2025 forecast that more than 40% of agentic AI projects will be cancelled by 2027 cites three drivers: escalating costs, unclear business value, and inadequate risk controls. The third is a governance failure in plain language.

Integration depth

The MIT report's coverage in Tom's Hardware singled out flawed integration as the root cause the report names most often. If a pilot sits next to the workflow it was supposed to change, you will never see it in the P&L. In practice, that means redesigning the process around the AI rather than slotting the AI next to a process that has not moved.

What this looks like in practice

At Appify Intelligence we argue this case from the stage the same way we argue it from the strategy deck. Our CEO Karl Mulligan closed a talk at The Fold in Leamington Spa on 21 April 2026 with the line: "Productivity is easier than profit. It's easy to do more with AI. It's much harder to make that extra productivity show up on the bottom line. Governance, human-in-the-loop, guardrails - these aren't buzzwords you spray at investors. They're the difference between AI that saves your business and AI that embarrasses it."

We run the same discipline on our own tooling. The internal pipeline we use to qualify Upwork leads scores every job post with a Claude-based rubric, ranks it 1 to 10, and passes the top tier to a two-step human-confirm flow before any proposal is sent. The AI never acts unsupervised. Every rejection feeds back into the scoring prompt, so the criteria evolve toward what we actually want rather than what we told the model to want on day one. That is a learning loop: evaluation, governance, and integration depth in one small system.

Our post on RAG is not a search bar walks through the same pattern inside a narrower problem: the fixes that matter are the boring ones, and most teams have never run them.

What to do this quarter

If you are operating a pilot that is "working" but not moving the P&L, the model question is rarely the useful one. Useful questions, in order:

What is the single business metric this pilot is supposed to move, and who on the business side owns it?
What is our evaluation loop at the task level, separate from any model vendor's marketing numbers?
What is the kill criterion, by date?
What workflow does this AI system live inside, and has that workflow been redesigned, or has the AI been pasted in?

If the first two are unclear, the discipline to invest in is use-case selection and evaluation, before anything else. If the last two are unclear, governance and integration depth.

None of these require waiting for a better model. All of them are why the 5% ship. If the description above reads like where your pilot is stuck, our AI consulting and AI automation teams are the ones who have this conversation with operators for a living. The first call is free, and we will tell you honestly whether the problem is the pilot, the workflow, or the use case.