Here's something that surprised me early on: most GenAI systems don't fail because the model is bad. They fail because real production traffic is messy. A provider slows down, one prompt edit shifts the output shape, a tool returns partial data, and suddenly your UI is showing answers that look confident but are completely wrong.
After hitting this wall a few times, I stopped optimizing for demo quality. That's the wrong target. I started optimizing for what happens when things go sideways.
What actually changed in how I build
I treat every model call the way I'd treat an external API I don't own or control.
- Output has to match a strict schema.
- If it doesn't, fail fast. Don't guess.
- Every failure path needs to be explicit, not tucked away in a catch block.
If output parsing fails, that's not a minor warning you log and move on from. That's a failed request. Full stop.
Tool calls are where reliability actually breaks down
Once tools enter the picture, everything gets harder. The serious incidents I've seen in production almost never came from prompt wording. They came from tool behavior that nobody fully thought through.
My baseline rules for every tool:
- Typed input, typed output. No loose objects.
- Timeouts and bounded retries are non-negotiable.
- Partial data gets labeled as unknown, not interpreted as truth.
- The model can't silently continue after a tool failure.
That last rule alone removes a surprising number of invisible bugs. Unknown never gets treated as known.
Three runtime paths I actually route through
For every request, I force one of these three paths explicitly:
-
Pass when the schema is valid, evidence is sufficient, and tool results are complete enough to act on.
-
Retry when context is too narrow or constraints are too loose. Retry once or twice with tighter parameters, then stop. No infinite loops.
-
Fallback when retrying isn't enough. Switch to a safer model, ask a clarifying question, or escalate to a human for anything high-stakes.
The goal isn't to avoid failure. It's to fail in a way you planned for.
You can't trust a system you can't trace
When something goes wrong, I need to figure out why in a few minutes, not a few hours. So every user-visible answer in my systems gets tied to:
- which prompt version ran
- which model version ran
- which context chunks were retrieved
- what the tools actually returned
- what the validation result was
- which decision path was taken
If I can't reconstruct an answer quickly, I don't trust the system. And neither should you.
What I check before shipping any change
Before touching prompts, models, or tool behavior:
- Run offline evals on the cases I know are hard.
- Run shadow traffic with zero user exposure.
- Roll out to a small canary slice first.
- Keep kill switches ready at the model and tool level.
- Make rollback immediate and versioned.
None of this is glamorous. It's just what keeps systems stable while you're still shipping at a real pace.
The thing most people miss
Reliability isn't something you bolt on after the demo goes well. It's part of product design from day one. Clear contracts, strict tool discipline, and traceable decisions aren't slowdowns. They're what lets you move fast without burning user trust every other week.