AI agent for back-office operations: topology, observability, fallbacks
What a back-office AI agent actually looks like in production: the node topology, the typed messages between them, the fallback paths, and the tracing.
For: Founders, COOs and Heads of Operations scoping a back-office AI agent before signing or building

Gartner expects over 40% of agentic AI projects to be cancelled by the end of 2027, citing "escalating costs, unclear business value or inadequate risk controls". Their other line is sharper. Of the thousands of vendors selling agentic AI today, Gartner estimates "only about 130" are doing anything that deserves the label. The rest is "agent washing" - rebranded RPA (robotic process automation), chatbots, and assistants with a new marketing layer.
If you run a business with a back office and a vendor has pitched you an agent in the last six months, you have probably met some of those 130 and a lot of the rest.
This post is for the founder, COO or Head of Operations doing the buying. The argument is narrow. A back-office agent that survives production is a topology - named nodes, typed messages between them, a deterministic fallback path, and a tracer attached on day one. Capability is the easy part. The topology is what most pitches will not show you, and it is the part that decides whether the agent ships, gets pulled, or turns into a line item in the next audit.
We have written separately about the agent control surface for the engineering team building the agent. This post is for the operator buying or scoping one; where they overlap, that post goes deeper on the loop, this one goes deeper on the back-office shape.
What "back office" means here
Back-office work in mid-market firms looks similar across sectors: procure-to-pay, record-to-report, ticket triage, invoice matching, AP (accounts payable) exception handling, employee onboarding, IT access requests, vendor onboarding, contract intake. McKinsey's State of AI 2025 names the highest-potential targets directly: "Finance and planning areas like procure-to-pay, record-to-report, and forecast-to-plan show the greatest potential for agentic AI."
Three things make this work fit agents and badly fit naive automation. Inputs are messy (PDFs, emails, free-text tickets). Decisions are bounded (codes, GL (general ledger) lines, approval routes) without being deterministic. An error is costly but recoverable if caught early.
What does not fit: anything load-bearing on safety, anything where a wrong answer cannot be unwound, anything the regulator wants a named human signing off. Those workflows need humans assisted by an agent rather than the other way around.
The topology
The shape that works is a directed graph. Every back-office agent we have built or audited reduces to a small number of named nodes, an explicit state object passed between them, and one or two exit edges per node. LangGraph's documentation names the primitives plainly: nodes are functions doing the work, edges route control between them, and a checkpointer snapshots the graph state per thread so the run can pause at a node, surface to a reviewer, and resume from that exact point without recomputing the steps before it. The frameworks differ on syntax. The shape is universal.
For an invoice-processing agent, the production topology is roughly:
ingest -> classify -> extract -> validate -> match -> [decide]
/ \
act human_review
|
resume
Six functional nodes plus a human-review interrupt. Each one has a single responsibility, a typed input, a typed output, and a defined failure mode.
ingest: reads the source (email attachment, supplier portal pull, SFTP drop). Output: RawDocument (id, source, mime, bytes).
classify: decides what the document is (invoice, credit note, statement, dunning letter, junk). Output: ClassifiedDocument with a confidence score. If confidence is below threshold the case routes to human_review instead of going on to the next node. Anthropic calls this pattern routing: it "classifies an input and directs it to a specialized followup task".
extract: pulls structured fields (supplier, PO reference, line items, totals, VAT). Output: ExtractedInvoice plus a per-field confidence map. The map is load-bearing. Total confidence is a number you cannot act on; per-field confidence is the thing that decides which fields the next node can trust.
validate: deterministic, no LLM. Schema checks, arithmetic checks (line items sum to total), VAT rate plausibility, duplicate-invoice check against the last 90 days. This node fails closed - any failure goes to human_review with the offending field flagged.
match: looks up the PO (purchase order), the supplier record, the GL coding rules. Output: MatchedInvoice with a routing decision (auto-post, route to approver, hold for human).
act: writes to the ERP (enterprise resource planning system). The only node with side effects. Single retry on transient failure, escalate to human_review on permanent failure or any 4xx.
human_review: the LangGraph-style interrupt. The graph pauses, saves state, and surfaces the case in a review queue with the full trace attached. A human resolves the case and the graph resumes from the interrupted node rather than restarting from ingest.
Naming changes by domain. The shape rarely does. A ticket-triage agent runs classify -> route -> draft -> approve -> send. Five-to-seven nodes, a typed state, a deterministic guardrail node, and a human-review interrupt are constant.
Typed messages between nodes
The state object passed between nodes is the contract. A free-form dictionary that any node can write to gives you a chat. A typed record with explicit ownership of every field gives you a system.
We use Zod or Pydantic schemas for the state, the per-node input, and the per-node output. A node that returns malformed state fails closed. A node that wants to add a field declares it in the schema first. The point is observability as much as correctness. When the agent does something unexpected at 02:00 on a Sunday, the trace has to tell you which node wrote which field with what confidence. Free-text state hides exactly the information you need.
Anthropic's own multi-agent system hit this. Their early agents "spawned 50 subagents for simple queries" and "scoured the web endlessly for nonexistent sources" because the tool descriptions overlapped and the agents kept guessing. The fix was sharper tool definitions and tighter typing of what each agent could ask for and return. Their token-usage note matters here too: token usage by itself "explains 80% of the variance" in agent performance, and a typed schema is the cheapest way to keep an agent from re-asking its own questions.
Observability is not optional
The LangChain State of AI Agents 2025 survey (n=1340) found 89% of teams have implemented some form of observability for their agents, against 52% running offline evaluations - and for teams with agents already in production the observability number rises to 94%. The number is unusually consistent with what we see in the field. Teams figure out fast that an agent without traces is unreviewable.
The observability stack that works for back-office agents has three layers.
The first is span-level tracing. Each LLM call, each tool call, each retrieval is a span with input, output, latency, token count, and cost attached. The OpenTelemetry semantic conventions for LLMs - Arize's OpenInference is the most widely adopted - mean a trace from LangSmith, Langfuse, Helicone, Arize AX, or Phoenix shares a schema. Pick one vendor; switch later without rewriting instrumentation.
The second is structured run-level state. The full state object at each checkpoint, queryable by run id. When a finance reviewer asks "why did this go to human review?", the answer should come from the validate node's output rather than from a reconstructed chat transcript.
The third is a per-domain dashboard the operator can read without engineering help. Throughput per node, auto-post rate, human-review rate by reason, time-to-resolution on flagged cases, cost per processed invoice. The dashboard tells the COO whether the agent is improving or drifting. None of the agent vendors we have audited ship this layer by default. It is the cheap thing to build and the expensive thing to skip.
The fallback path is the product
Most agent pitches focus on the happy path. The fallback path is where the product is.
Anthropic's Building Effective Agents names the same primitive: "Agents can then pause for human feedback at checkpoints or when encountering blockers." Pausing before an irreversible action - posting an invoice, approving a payment, deleting a record - is the cheapest way to keep an LLM mistake from becoming an audit finding. Four named patterns we use across builds:
Confidence-routed escalation
Every classification and extraction node emits a confidence score. The graph has a single configurable threshold per node. Below threshold routes to human review. The threshold is logged and tunable from the dashboard. It should never sit buried in a config file the operator cannot see.
Deterministic guardrail nodes
Every node where an LLM made a decision is followed by a non-LLM check. Schema validation, business-rule check, duplicate check, numeric reconciliation. The guardrail node is allowed to override the LLM and route to human review. If the only check on the LLM is another LLM, the system has no floor.
Resume from the interrupt point
When human review resolves a case, the graph resumes from the paused node and skips the work that already ran. LangGraph's Command(resume=...) is the canonical pattern; the principle predates the framework. Anthropic reports the same: they built systems that "can resume from where the agent was when the errors occurred" rather than restarting, because restart-from-zero burns tokens and frustrates reviewers.
A hard step budget
Maximum iterations per run, maximum total tool calls, maximum wall-clock seconds, maximum cost. Hitting any ceiling routes to human review with the partial state preserved. Without ceilings, an agent that misclassifies one document can quietly burn a four-figure token bill before anyone notices. For back-office work the ceilings are tighter than for code-writing agents because the unit cost per processed item is small and the volume is high. We covered the underlying argument on budgets in the sibling post on the agent control surface and the evals shapes most teams skip.
What to demand from a vendor pitch
If you are evaluating a vendor or your own team's design, the diagnostic is short.
Ask to see the topology. The actual graph with named nodes and edges. If the answer is "an agent with tools", the system has no topology yet.
Ask which nodes fail closed. If every node has the same failure mode (escalate to a generic queue), you have a chat with a backlog rather than an operations system.
Ask what the human-review interface looks like. Who reviews, what they see, how the resolution writes back. A back-office agent without a review interface in week one will get one in week three, built badly.
Ask which observability layer is shipped. Spans, run-state, operator dashboard. "On the roadmap" means no observability.
Ask what happens during an LLM provider outage. Most agent products fail open during a provider incident; an invoice queue that silently fails open for four hours is the same audit finding as a missing approval. The acceptable answer is the agent stops processing and the queue holds.
We run this conversation as a paid AI operations audit for operators who have already been pitched, or who have an agent live and are not sure whether to defend it or pull it. The agent is the interesting question. The topology, the observability, and the fallback path are the answer. If you have an agent in scope and these are not nailed down, the first conversation is free.
Tagged


