Back to blog

Context engineering playbook I use before touching prompts

The practical method I follow to shape inputs, retrieval, and constraints so outputs stay stable.

PublishedFebruary 14, 20267 min read

Almost every prompt quality problem I've run into is actually a context quality problem in disguise.

When a model output feels weak or unpredictable, my first instinct isn't to rewrite the prompt. It's to look at what the model is actually seeing. In my experience, output quality is mostly decided before the model even starts generating. It comes down to what you send, how you organize it, and how clearly you signal what should be trusted.

Context is an interface. Treat it like one.

How I structure what the model sees

I split the context payload into distinct blocks, always in this order:

  1. instructions
  2. hard constraints
  3. user state
  4. retrieved evidence
  5. tool outputs

This makes debugging way more practical. If something's wrong, I can isolate which block is the problem instead of staring at one giant prompt and guessing.

Not all data should carry equal weight

This one took me a while to really internalize. I use a simple four-tier hierarchy:

  • Tier A: system-of-record outputs and approved policies
  • Tier B: retrieved documents that might be stale
  • Tier C: user-provided data
  • Tier D: model assumptions

The rules that flow from this are strict. Tier A wins any conflict. Tier B has to be attributable. Tier C is input, not verified fact. Tier D is never treated as evidence for anything.

Without something like this, models blend everything into one confident-sounding answer where you can't tell what was real and what was assumed.

Wide retrieval usually isn't the answer

There's a temptation to retrieve as much as possible and let the model sort it out. In my experience, that just adds noise. I default to narrow, precise retrieval and only open it up when the task genuinely needs breadth.

The specifics: intent-aware queries, metadata and recency filters, controlled chunk sizes, no duplicate chunks, and k that's tuned to the actual task. For decision-heavy flows, I'll always pick precision over recall. For synthesis tasks, I'll allow more breadth but I still require source-grounded output.

"Be concise" is not a constraint

Soft instruction sentences don't give you stable behavior. A strict output schema does.

I put constraints right next to output contracts, and I make them explicit:

  • allowed categories
  • required fields
  • refusal conditions
  • evidence references that have to be present

When constraints are machine-readable, wording matters less. Behavior becomes predictable.

Freshness isn't one-size-fits-all

Different tasks have genuinely different tolerance for stale data. Product copy can usually run on older context just fine. Anything policy-related basically can't.

So I define freshness policies at the task level: TTL by source type, refresh triggers, markers for stale data, and a safe fallback for when freshness is unknown. One global freshness rule is almost always the reason an assistant starts feeling randomly wrong over time.

When behavior drifts, I run this loop

  1. Replay the failed examples.
  2. Diff the payload blocks against a known good run.
  3. Freeze the model version so I'm not chasing a moving target.
  4. Figure out which block changed.
  5. Patch the smallest possible surface.
  6. Add the case to the regression set.

Keeping this loop tight means fixes stay fast and I don't end up overcorrecting prompts when the real problem is somewhere else.

What this actually buys you

Context isn't extra text you stuff in before the real prompt. It's the core product interface. Get the structure right, and model behavior becomes a lot more predictable than most people expect.

Want a specific angle covered?

Tell me what you're building and I'll write it.