Long-running agents can look great in demos and still fail in production. I learned this the hard way while building workflows with many tool calls. The common failures were almost always the same: unclear goals, weak tool contracts, poor memory handling, and no budget limits.
The fix is to treat agents as systems, not scripts.
Plan first, execute second
I never let an agent start by calling tools without a plan.
Each run needs:
- explicit steps
- success condition per step
- retry cap per step
- global limits for time, cost, and tool calls
- clear stop conditions
If "done" is not clearly defined, the agent keeps looping and quality drops.
Tool contracts decide system quality
Tooling is where agent reliability is won or lost.
For every tool I enforce:
- typed input
- typed output
- timeout and bounded retries
- explicit error shape
- idempotent or reversible behavior
Most severe bugs happen when tool errors get converted into fluent text. I keep errors explicit all the way through the pipeline.
Runtime lanes for control
I use four decision lanes:
- Pass: step succeeded with strong evidence
- Retry: limited retry with narrowed context
- Fallback: safer strategy or simpler path
- Escalate: handoff to human or ask user clarification
Without explicit lanes, agents make those decisions implicitly and inconsistently.
Memory model that scales
Transcript stuffing breaks quickly. Instead, I keep:
- an append-only event log
- a compact working state
- references to evidence IDs
After key checkpoints, I summarize state in structured form. This keeps token growth under control and reduces drift across long runs.
Safety means enforceable controls
I do not treat "agent confidence" as evidence.
A claim is trusted only if it can be traced to:
- verified tool output
- trusted system record
- attributed retrieved evidence
If verification fails, the agent must either ask for clarification or escalate.
Metrics I actually track
Answer quality alone is not enough for agents. I track:
- task completion rate
- calls per completed task
- fallback rate
- escalation rate
- budget compliance
- incident type breakdown
These metrics show whether changes improve real outcomes or only make traces look cleaner.
Final note
Agents scale when the system around them is disciplined. Clear plans, strict tool contracts, controlled memory, and measurable runtime behavior matter far more than longer prompts.