Agents in production: the control surface is the product
Most production AI agents fail on the layer around the model. Tool design, the loop, the step budget, the context shape: where the engineering goes.
For: CTOs, engineering leads, and Heads of AI/Data running production agent features

Run the same agent against the same task eight times. On Sierra's tau-bench retail benchmark (June 2024), GPT-4o solves it under 50% of the time on a single run and around 25% of the time across all eight. A 60-point reliability gap between identical inputs and identical task. The pass^k metric Sierra introduced (probability the agent succeeds across k consecutive runs) is one of the few that captures the shape.
That is what production looks like for most teams shipping agents in 2026. The agent is not broken; it is non-evaluable. And an agent you cannot evaluate has no path to either reliability or sustainable cost.
The shape of the failure is rarely the model. It is the layer around the model: the tools, the loop, the step budget, the context shape, the deterministic checkpoints between steps. Together those make up the agent's control surface. Most teams treat that layer as scaffolding around the real work. It is the work.
This post is for engineering leads, CTOs, and Heads of AI building agents into a product. None of what follows is exotic. The discipline is boring. That is most of why teams skip it.
What the control surface is
Simon Willison's working definition, which Anthropic has aligned with, frames an LLM (large language model) agent as "tools in a loop". That makes the whole system observable in two pieces: the loop and the tools. If those are non-deterministic, the agent is non-evaluable by construction.
Anthropic's Building Effective Agents (Dec 2024) tells teams to give tool definitions "just as much prompt engineering attention as your overall prompts" and to bake stopping conditions into the loop "such as a maximum number of iterations". Their Effective Context Engineering post (Sep 2025) goes further. They renamed the discipline. Prompt engineering became context engineering: "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference". The headline failure mode in that post is bloated tool sets that "cover too much functionality or lead to ambiguous decision points".
The lever named by the people who train these models is the tool surface and the context shape. The model is doing fine.
A third-party reverse engineering of Claude Code's architecture puts a number on it: roughly 1.6% of the codebase is AI decision logic and 98.4% is operational infrastructure (permission gates, context management, tool routing, recovery logic). Take the split as approximate. The point is the proportion. The agent loop is a while-loop. The engineering is everything around it.
The four parts of the control surface
Tool design
Tools are the agent's API to the world. They have type signatures, error semantics, and documentation, and the model reads all three. Anthropic's specific failure mode is naming. Tools whose intended use overlaps confuse the model on every call. The fix is the obvious one. Self-contained tools, a clear naming convention, error messages the model can act on, and a small enough surface that the model holds it in working memory.
Anthropic's own multi-agent research system caught this in their telemetry. Early agents "spawned 50 subagents for simple queries" and "scoured the web endlessly for nonexistent sources". The cause sat in the tool descriptions. Sharper docs cut the failure mode.
Step budget
Open-loop agents run until they decide they are done. Most production teams discover what "done" means to the agent the hard way. A maximum step count, a wall-clock budget, and a per-step cost ceiling are the floor. Cognition's Devin annual review reports a year-on-year merge rate moving from 34% to 67%. The improvements they cite are about scoping and clarity rather than model upgrades.
The harder question is what happens when the budget hits zero. A useful agent gives back a partial result, a clear failure mode, and the trace of what it tried. A useless agent loops or returns silence.
Context shape
Chroma's Context Rot research (Jul 2025) tested 18 frontier models and found performance "increasingly unreliable as input length grows". Even on minimal needle-in-a-haystack tasks, models degrade well below their nominal context windows. A 200K-token model reliably uses far less than 200K tokens.
That has a direct implication for production agents. The default move when an agent struggles is to feed it more context. The data says that direction degrades performance. Anthropic's Claude Code best practices is explicit: "Most best practices are based on one constraint: Claude's context window fills up fast, and performance degrades as it fills." The work is curation. What does the next step actually need? What can be summarised, pruned, or moved out of the prompt entirely?
Deterministic guardrails
The model is non-deterministic. The system around it does not have to be. Anthropic distinguishes hooks, which "run scripts automatically at specific points in Claude's workflow", from CLAUDE.md instructions, which are advisory: "hooks are deterministic and guarantee the action happens". That is the principle.
In a production agent, the deterministic layer covers permission checks before any destructive tool call, schema validation on every tool input and output, retry logic with explicit ceilings, and a verification step before any user-visible result ships. Anthropic's own line is sharper. "If you can't verify it, don't ship it."
The diagnostic
Hamel Husain's failure-mode matrix is the cleanest practical tool for finding which part of the control surface is the problem. Take a sample of failing traces. Map each one to the first upstream failure: tool choice, parameter extraction, error handling, context retention, goal checkpoint. Score components separately. The component with the highest first-failure rate is where the engineering goes.
This is the work most teams skip. They look at output quality, find it disappointing, and reach for either a bigger model or a bigger prompt. Both leave the control surface untouched. Both are expensive.
Cost is downstream of the control surface
There is a tempting counter-argument. Anthropic's multi-agent research post reports that agents use "about 4x more tokens than chat interactions" and multi-agent systems "about 15x more tokens". That is the unit-economics wall, and it does not wait for the determinism wall.
The framing here is that the two are the same wall. An agent with a bloated tool surface burns more context per step. An agent with a runaway step budget compounds that burn linearly. An agent without verification spends tokens on results that ship and need rework. The cost levers that move the bill the most - caching, retrieval, tier routing - all assume a control surface stable enough to apply them.
LangChain's State of AI Agents survey (Dec 2025, n=1340) backs this up at the population level. The top barriers to production agents are quality (32%) and latency (20%). Cost is not in the top tier. Quality and latency are both control-surface problems before they are model problems.
What to demand
If you are an engineering lead with an agent feature live or about to ship, the diagnostic is short. Pull a week of traces. Run the failure-mode matrix on the bottom decile. Count how many failures land in tool design, step budget, context shape, or guardrails. If the answer is more than 60%, the model is fine. The product is the control surface.
If you are evaluating a vendor pitch, the question is the same. Ask to see the tool definitions. Ask what the step budget is. Ask what happens at the budget ceiling. Ask for the verification step. Demos that hand-wave any of those are demos with no production answer.
We build agents inside AI consulting and AI automation engagements where the operator has already been burned by something that demoed well and shipped badly. The conversation we have is about the layer around the model. That is where the work lives.
Tagged


