What production AI evals look like: the shapes most teams skip
Most teams ship AI on a smoke-test eval and a benchmark dump. The eval shapes that catch real regressions, and where 'more evals' starts to hurt.
For: CTOs, engineering leads, and Heads of Data running production AI features

On 16 April 2026, Anthropic shipped a Claude Code system-prompt change alongside Opus 4.7. Their internal eval suite showed no regression. After users flagged behavioural shifts, the team ran ablations against a wider set of evals and found a 3% quality drop on Opus 4.6 and 4.7. Per the April 23 postmortem, the new policy is to run a broader per-model suite on every system-prompt change.
That is one of the better-instrumented engineering organisations in the industry telling on itself. The macro version, from MIT's State of AI in Business 2025, is bleaker: 95% of enterprise generative AI initiatives produce no measurable P&L impact, and one named cause is that those systems "do not retain feedback, adapt to context, or improve over time". Both findings point at the same gap. Production AI is hard to make reliable, and the discipline that closes the gap is evaluation.
This post is for engineering leads, CTOs, and Heads of Data with an AI feature live or about to ship. None of the eval shapes below are exotic. Most teams have one or two and call it done.
The shapes
Across the canonical eval frameworks - Anthropic's eval guide, Promptfoo, Braintrust, Langfuse, Vercel AI SDK, and the UK AI Security Institute's Inspect - the same handful recur. They do different jobs. Maintaining one without the others gives you a single channel of feedback, and a feature with a single channel of feedback fails at the seam.
1. Golden-set regression evals
The floor. A frozen set of inputs with known-correct outputs, run on every code or prompt change, with a target pass rate near 100%. Anthropic's engineering team draws the line plainly: capability or quality evals should start at a low pass rate; regression evals should hit nearly 100%.
The pattern is borrowed from regression testing in software, and the rules are the same. New failures fail the build. Old failures stay flagged. The set grows when you find a new bug and stays the size it grew to. For a RAG (retrieval-augmented generation) system this is precision@k and Recall@10 against a frozen corpus, both of which we covered in our earlier piece. For a chat agent it is intent-classification accuracy on a labelled set. For a structured-extraction tool it is schema validity plus field-level F1. Whichever shape it takes, the test set is checked into the repo, versioned, and gated in CI before any deploy.
2. LLM-judge evals with calibrated rubrics
Code grading covers the cases where "right" is decidable. Most production AI behaviour falls outside that. Was the summary faithful to the source? Did the chatbot respect the brand voice? Did the agent escalate to a human when it should have? Those are judgement calls, and a frozen test set of golden answers does not capture them well.
The standard solution is an LLM-as-judge: a strong model (often a different family from the one being evaluated, to dodge self-preference bias) scoring outputs against a written rubric. Zheng et al. (2023) showed that GPT-4 judges agreed with human raters about 80% of the time on open-ended tasks - "the same level of agreement between humans". That number is the entire reason the technique works.
The work is in the rubric. Anthropic's rubric guidance is concrete: detailed, empirical, structured output, and "ask the LLM to think first before deciding an evaluation score, and then discard the reasoning". A 1-5 Likert rubric with explicit examples beats "is this good?" every time. The output schema is constrained to a single token or small integer. The reasoning is generated, parsed, and thrown away.
3. Human-eval sampling that audits the judge
The judge is software too. It has bugs, and three of them are old enough to have names. Position bias is the largest: a study across 15 judges and 22 tasks, 150,000 evaluations, found position bias "is not due to random chance and varies significantly across judges and tasks... strongly affected by the quality gap between solutions". Mitigations are order randomisation and pairwise voting. Verbosity bias rewards longer answers. Self-enhancement bias rewards a model's own outputs. The 2025 EMNLP paper Rating Roulette added self-inconsistency to the list: a judge will disagree with itself across runs.
The audit is a small human-graded set sampled from production. Promptfoo's LLM-as-judge guide writes the loop down: "refine rubric wording until agreement is >90%... run against holdout examples (that you never tuned on) to check for overfitting." That target is the whole credibility of the system. Without it, the judge's score is a confident vibe.
In practice the loop runs weekly. Sample 50-100 production traces. Two reviewers grade independently. Compute judge-human agreement on the same instances. If agreement falls below 80%, the rubric needs sharper wording. Swapping the judge model rarely does what people expect.
The honest part is that this is the shape teams cut first, because it is the only one that needs a human on the team. A domain expert grading 100 traces a week is not "one calendar block"; it is a recurring commitment, and somebody has to own the rubric. Most failures of judge-eval programmes are organisational, not algorithmic - the rubric drifts, the labellers rotate, and the agreement number stops getting computed. Naming an owner with a weekly cadence is the entire intervention.
4. Production telemetry feeding the eval set
Synthetic evals miss what the team did not predict. Anthropic's engineering post is direct: "[production monitoring] reveals real user behavior at scale and catches issues that synthetic evals miss... [but it] remains reactive; problems reach users before you know about them". The fix is a loop. Production traces flow into a labelling queue, the worst-graded ones get added to the regression set, and the regression set then catches that class of failure on the next change.
Braintrust calls this the production-feedback stage: "pull interesting production traces into datasets to improve offline test coverage". Langfuse runs LLM-judges directly on filtered production traffic, asynchronously, so the cost stays off the request path. The ergonomics matter. If a thumbs-down in your product UI does not produce a labelled training instance within a few minutes, the loop is decorative.
Why teams skip them
Each shape has a specific reason it gets cut. Naming the reason is the start of fixing it.
Golden-set evals look slow to write. They are slow the first time. The cost amortises across every future change. A team that cannot put a regression test on the bug it found yesterday will find that bug again next quarter.
LLM-judge rubric work feels endless. Anthropic's docs concede the point: "prioritise volume over quality... more questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals". Ship a thin v1 rubric and tighten it.
Human sampling needs an owner. See section 3. The fix is a calendar block and a name on it.
Production telemetry touches the request path. Most observability vendors now ship eval hooks built in. The work is plumbing; the integration is a feature flag away.
Where more evals start to hurt
Two failure modes are worth naming.
Goodhart's Law applies. Once an eval becomes the target, the system gets optimised for the eval more than for the underlying behaviour. Chatbot Arena worked through this in 2024-25, where vendors tuned for the leaderboard until the leaderboard slowly stopped predicting model quality. Mitigations are a rotating holdout that no one optimises against, plus the production-feedback loop above. If your team has only one eval suite and it gets tuned to, you have built a benchmark and called it a product.
Benchmark scores also do a poor job of predicting production fit. Stanford's AI Index 2026 puts it plainly: "Knowing that a benchmark for legal reasoning has 75 percent accuracy tells us little about how well it would fit in a law practice's activities." Public benchmark scores are lossy compression. Your evals need to look like your product. The leaderboard is a different question.
The discipline is the differentiator
McKinsey's State of AI 2025 identifies the same pattern at the population level. High performers are "more likely than others to say their organizations have defined processes to determine how and when model outputs need human validation", and run "human-in-the-loop rules, rigorous output validation, centralized AI governance". The practice is unglamorous and the work is repetitive. That is also why most teams do not do it.
If you have an AI feature in production and you are unsure which of these shapes you are missing, that is the audit. We do these. The first conversation is free, and we will tell you honestly whether the gap is worth closing. Get in touch.
Tagged


