Root Cause Analysis for LLM Behaviour, How To Test AI

The instinct to patch the prompt

The model does something it should not. A user reports it. You open the system, look at the prompt, and you can already see the line you are about to add. “Always include all five products.” “Never mention competitors.” “Do not refuse this kind of question.” The change goes in, the behaviour improves, the ticket closes.

Then a week later it comes back, in a slightly different form, and the next change goes in on top of the last one. The prompt grows. The general behaviour gets harder to predict. The original cause is still in there somewhere, untouched.

This is the loop that root cause analysis is built to break.

What it is

Root cause analysis is a process for determining the underlying cause of an issue. It is a staple of traditional software testing, and it works just as well on Large Language Model behaviour, where the failures are non-deterministic, often statistical rather than absolute, and rarely reproducible from the report alone.

Why it matters

Failures to comply with a prompt or context are difficult to track down. It is easy to attempt changes to a prompt without understanding the underlying issue. RCA prevents issues from re-occurring in different forms.

This matters more for LLMs than it does for traditional systems. Unwanted behaviour often does not present itself on the run that triggered the report. It shows up as a statistic across many runs. The instinct is to roll out a quick fix when really it’s much more effective to spend more time investigating the issue.

The knee-jerk fix, adding another instruction to the prompt, usually masks the symptom without addressing the cause. RCA finds the actual structural problem.

The six-step process

1. Observability, turn the black box into a white box

For LLMs, this means capturing the exact inputs and outputs needed to reproduce the failure and understand what the model was working with. Collect the full prompt, the assembled context, the model version, the token usage, the log probabilities where available, and the final result.

Break the prompt into its structural components:

Instruction. What the model is told to do.
Context. The data or knowledge provided.
Rules. Behavioural constraints.
Examples. Any few-shot examples included.

Once the prompt is decomposed, each part can be examined as a candidate cause rather than the prompt being treated as one opaque block.

2. Categorisation

Assign failure types clear and logical labels. Good labels are specific and actionable. Vague labels are how patterns get missed.

Good vs poor failure labels

Poor label	Better label	Why it's better
Wrong answer	Factual error, medication interaction	Identifies the domain and error type
Bad response	Style violation, exceeded length constraint	Points to the specific rule broken
Hallucination	Confabulation, fabricated source citation	Distinguishes source fabrication from factual errors

A label that names the structural element involved is a label that points the next step at the right part of the prompt.

3. Pattern identification

To capture the nuances in behaviour, look for patterns or clusters in the categorised failures. The same individual failure means different things depending on the cluster it sits in.

Common clustering dimensions:

Position in context. Do failures correlate with where information appears in the prompt?
Interaction length. Do failures increase in longer conversations?
User behaviour. Do certain user phrasings trigger more failures?
Time and version. Did a model update or a prompt change shift the failure pattern?

4. Investigation

Analyse the identified clusters. Map each cluster back to the prompt structure from step 1. Ask which structural element could be causing the pattern.

For each failure cluster, identify: (a) which prompt component is most likely responsible, (b) whether the failure is deterministic or probabilistic, and (c) whether it interacts with other prompt components.

Interaction is the part that is most often missed. Two prompt components can each look fine in isolation while producing the failure together. The case study referenced below is exactly this pattern.

5. Hypothesise and test

Develop clear hypotheses based on the patterns and clusters from the previous steps. Test them objectively and systematically on minimal, reduced versions of the core functionality.

Isolate variables. Test on cut-down versions of the prompt so each test moves one thing at a time.
Scale iterations. Apply the More is Better principle. Single runs do not validate anything in a non-deterministic system.
Stay objective. Do not assume the cause. Let the data confirm or reject each hypothesis.
Iterate. Continue until the genuine root cause is isolated, not just until something improves.

6. Report on findings

Lessons from the analysis should be documented and applied to future implementations. Document:

The observed behaviour, with specific examples.
The root cause or causes identified.
The evidence from hypothesis testing.
Recommended fixes, with tradeoffs noted.
Lessons learned for future implementations.

The value of RCA isn’t finding this bug, it’s building the diagnostic muscle that catches the next one faster.

A worked example

The methodology becomes much clearer when applied to a real failure. The companion piece Context Order Bias: Why This LLM Only Sold Apples and Oranges walks through the full process on a fruit-seller chatbot that consistently ignored two of its five products. It applies every step of this methodology, isolates two interacting biases through hypothesis testing, and arrives at a structural fix rather than a prompt patch.

If you have just read the methodology and want to see it run end to end, that case study is the demonstration.

Common root causes in LLM applications

A handful of structural causes turn up repeatedly across LLM applications. Recognising them speeds up the analysis on subsequent investigations.

Common root cause categories

Root cause	How it manifests	Typical fix
Context order bias	Model favours information listed earlier in the prompt	Randomise context order per request
Conflicting instructions	Two rules contradict, model satisfies one at the expense of the other	Audit rules for conflicts, prioritise explicitly
Output constraint interference	Length or format rules prevent the model from completing its task	Relax constraints or restructure the task
Insufficient context	Model confabulates to fill gaps in provided information	Add explicit 'say I don't know' instructions
Temperature mismatch	Too much randomness for factual tasks, or too little for creative ones	Align temperature with task requirements

These are starting points for the investigation step, not conclusions. A pattern that looks like context order bias on first inspection might turn out to be conflicting instructions once the hypothesis testing runs. The categories help frame the search. They do not replace it.

Template

Use this structure for your own RCA reports:

Observed behaviour. What happened, with specific examples.
Reproduction steps. Exact prompt, model, settings.
Failure categorisation. Labels applied.
Pattern identified. Cluster description.
Hypotheses tested. Each hypothesis and the result.
Root cause. Final determination.
Recommended fix. With tradeoffs.
Lessons learned. For future reference.

Final Thoughts

Back to the engineer about to add another line to the prompt. With RCA in place, that line does not get written. The investigation runs first. The structural cause gets identified, the fix happens at the right level, and the prompt does not grow another paragraph that future engineers will have to work around.

That is what RCA buys. Not just this fix, but a smaller, cleaner system to maintain, and the diagnostic muscle to handle the next failure faster than the last one.

This methodology has been refined through application across multiple client engagements. The template is provided freely for use in your own evaluation processes.

Root Cause Analysis for LLM Behaviour