Most GenAI systems do not fail because the model is "bad." They fail because real production traffic is messy. A provider slows down, one prompt edit changes output shape, a tool returns partial data, and suddenly your UI is showing answers that look confident but are wrong.
After seeing this repeatedly, I stopped optimizing for demo quality. I started optimizing for failure handling.
What changed in my approach
I treat every model call like an API boundary I do not control.
- Output must match a strict schema.
- Invalid output must fail fast.
- Every failure path must be explicit.
If output parsing fails, that is not a minor warning. That is a failed request.
Tool calls are the real reliability problem
Once tools enter the loop, the system becomes much harder. Most serious incidents I have seen came from tool behavior, not from prompt wording.
My baseline rules are simple:
- Every tool has typed input and typed output.
- Timeouts and bounded retries are mandatory.
- Partial data is labeled as unknown, not interpreted as truth.
- The model cannot silently continue after tool failure.
This single rule cuts a lot of invisible bugs: unknown is never treated as known.
Three runtime paths I actually use
For each request, I force one of these paths:
-
Pass
- Schema is valid
- Evidence is sufficient
- Tool results are complete enough
-
Retry
- Narrow context
- Tighten constraints
- Retry once or twice, then stop
-
Fallback
- Switch to a safer model or simpler path
- Ask a clarification question when needed
- Escalate to human review for high-risk cases
The goal is not to avoid failure. The goal is to fail in a controlled way.
Traceability is mandatory
When something goes wrong, I need to answer "why" quickly. So every user-visible answer is tied to:
- prompt version
- model version
- retrieved context IDs
- tool outputs
- validation result
- final decision path
If I cannot explain an answer in a few minutes, I do not trust the system.
My release checklist before shipping
Before changing prompts, models, or tool behavior:
- Run offline evals on known hard cases.
- Run shadow traffic with no user exposure.
- Roll out as a small canary.
- Keep kill switches ready per model and tool.
- Keep rollback immediate and versioned.
Nothing here is flashy, but this is what keeps systems stable while still shipping fast.
Final note
Reliability is part of product design, not a cleanup step after launch. If your system has clear contracts, strong tool discipline, and traceable decisions, you can move fast without burning user trust.