I've built long-running agents that looked great in demos and fell apart in production. And the failures were almost never exotic. They were the same things every time: unclear goals, weak tool contracts, memory that grew without bound, and no defined budget for how many steps or calls were even allowed.
The honest fix is boring. You have to stop thinking of agents as smart scripts and start treating them as systems.
Plan first, then execute
I never let an agent reach for a tool without a plan in place first.
Every run needs to define:
- the explicit steps in order
- a success condition for each step
- a retry cap per step, not just globally
- hard limits on time, cost, and total tool calls
- clear stop conditions that aren't just "when it's done"
If "done" isn't precisely defined going in, the agent will keep looping. And output quality drops with every extra loop.
Tool contracts are where reliability gets decided
This is the thing I'd tell any team building agentic systems. The reliability of your agent is mostly determined by the quality of your tool contracts, not by your prompts.
For every tool I build or integrate:
- typed input, typed output, no exceptions
- timeout defined, bounded retries, not open-ended
- explicit error shape, not just a caught exception
- idempotent or reversible behavior where the task allows it
The most severe bugs I've seen in production happened when tool errors got silently converted into fluent model text and passed along as if nothing went wrong. I keep errors explicit all the way through.
Four lanes, not one
When I'm designing the runtime logic, I think in four explicit decision lanes:
- Pass: the step succeeded with strong, traceable evidence.
- Retry: retry with narrowed context, limited to a fixed count.
- Fallback: switch to a safer strategy or a simpler path through the task.
- Escalate: hand off to a human or ask the user for clarification.
Without lanes like these spelled out ahead of time, the agent makes those calls implicitly. And implicitly made decisions are the ones that create incidents at 2am.
Memory that doesn't blow up at scale
Stuffing the full transcript into every context window breaks fast. Once you're past a few hundred steps, the noise drowns out the signal.
What actually works for me: an append-only event log, a compact working state, and references to evidence IDs rather than the evidence itself. After key checkpoints, I summarize current state in a structured format. This keeps token growth linear and reduces drift across long runs significantly.
"Confidence" is not a verification mechanism
I've learned to be skeptical of any system that trusts what the model says it found. Agent confidence is not evidence.
A claim only gets treated as trusted in my systems if it can be traced back to a verified tool output, a trusted system record, or attributed retrieved evidence. If verification fails, the agent either asks for clarification or escalates. It doesn't guess and move on.
What I actually measure
Measuring output quality alone will mislead you. For agentic systems specifically, I track:
- task completion rate
- average tool calls per completed task
- fallback rate (are we hitting edge cases more than expected?)
- escalation rate
- budget compliance (did runs stay within defined limits?)
- incident type breakdown over time
These numbers tell you whether your changes improved real outcomes or just made the traces look a little cleaner.
What actually makes agents scale
More tokens in the prompt is almost never the answer. What scales is discipline in the system around the agent. Clear upfront plans, strict tool contracts, controlled memory, and runtime behavior you can actually measure. Get those four things right and you can push agents a lot further than most teams think is possible.