Root Cause Analysis for LLM Behaviour

A step-by-step guide to diagnosing why an LLM behaves unexpectedly. Covers observability, categorisation, hypothesis testing, and reporting, with a template you can use immediately.

Dec 2025 8 min

The instinct to patch the prompt

The model does something it should not. A user reports it. You open the system, look at the prompt, and you can already see the line you are about to add. “Always include all five products.” “Never mention competitors.” “Do not refuse this kind of question.” The change goes in, the behaviour improves, the ticket closes.

Then a week later it comes back, in a slightly different form, and the next change goes in on top of the last one. The prompt grows. The general behaviour gets harder to predict. The original cause is still in there somewhere, untouched.

This is the loop that root cause analysis is built to break.

What it is

Root cause analysis is a process for determining the underlying cause of an issue. It is a staple of traditional software testing, and it works just as well on Large Language Model behaviour, where the failures are non-deterministic, often statistical rather than absolute, and rarely reproducible from the report alone.

Why it matters

Failures to comply with a prompt or context are difficult to track down. It is easy to attempt changes to a prompt without understanding the underlying issue. RCA prevents issues from re-occurring in different forms.

This matters more for LLMs than it does for traditional systems. Unwanted behaviour often does not present itself on the run that triggered the report. It shows up as a statistic across many runs. The instinct is to roll out a quick fix when really it’s much more effective to spend more time investigating the issue.

The knee-jerk fix, adding another instruction to the prompt, usually masks the symptom without addressing the cause. RCA finds the actual structural problem.

The six-step process

1. Observability, turn the black box into a white box

For LLMs, this means capturing the exact inputs and outputs needed to reproduce the failure and understand what the model was working with. Collect the full prompt, the assembled context, the model version, the token usage, the log probabilities where available, and the final result.

Break the prompt into its structural components:

Once the prompt is decomposed, each part can be examined as a candidate cause rather than the prompt being treated as one opaque block.

2. Categorisation

Assign failure types clear and logical labels. Good labels are specific and actionable. Vague labels are how patterns get missed.

Good vs poor failure labels

Poor labelBetter labelWhy it's better
Wrong answerFactual error, medication interactionIdentifies the domain and error type
Bad responseStyle violation, exceeded length constraintPoints to the specific rule broken
HallucinationConfabulation, fabricated source citationDistinguishes source fabrication from factual errors

A label that names the structural element involved is a label that points the next step at the right part of the prompt.

3. Pattern identification

To capture the nuances in behaviour, look for patterns or clusters in the categorised failures. The same individual failure means different things depending on the cluster it sits in.

Common clustering dimensions:

4. Investigation

Analyse the identified clusters. Map each cluster back to the prompt structure from step 1. Ask which structural element could be causing the pattern.

For each failure cluster, identify: (a) which prompt component is most likely responsible, (b) whether the failure is deterministic or probabilistic, and (c) whether it interacts with other prompt components.

Interaction is the part that is most often missed. Two prompt components can each look fine in isolation while producing the failure together. The case study referenced below is exactly this pattern.

5. Hypothesise and test

Develop clear hypotheses based on the patterns and clusters from the previous steps. Test them objectively and systematically on minimal, reduced versions of the core functionality.

6. Report on findings

Lessons from the analysis should be documented and applied to future implementations. Document:

The value of RCA isn’t finding this bug, it’s building the diagnostic muscle that catches the next one faster.

A worked example

The methodology becomes much clearer when applied to a real failure. The companion piece Context Order Bias: Why This LLM Only Sold Apples and Oranges walks through the full process on a fruit-seller chatbot that consistently ignored two of its five products. It applies every step of this methodology, isolates two interacting biases through hypothesis testing, and arrives at a structural fix rather than a prompt patch.

If you have just read the methodology and want to see it run end to end, that case study is the demonstration.

Common root causes in LLM applications

A handful of structural causes turn up repeatedly across LLM applications. Recognising them speeds up the analysis on subsequent investigations.

Common root cause categories

Root causeHow it manifestsTypical fix
Context order biasModel favours information listed earlier in the promptRandomise context order per request
Conflicting instructionsTwo rules contradict, model satisfies one at the expense of the otherAudit rules for conflicts, prioritise explicitly
Output constraint interferenceLength or format rules prevent the model from completing its taskRelax constraints or restructure the task
Insufficient contextModel confabulates to fill gaps in provided informationAdd explicit 'say I don't know' instructions
Temperature mismatchToo much randomness for factual tasks, or too little for creative onesAlign temperature with task requirements

These are starting points for the investigation step, not conclusions. A pattern that looks like context order bias on first inspection might turn out to be conflicting instructions once the hypothesis testing runs. The categories help frame the search. They do not replace it.

Template

Use this structure for your own RCA reports:

  1. Observed behaviour. What happened, with specific examples.
  2. Reproduction steps. Exact prompt, model, settings.
  3. Failure categorisation. Labels applied.
  4. Pattern identified. Cluster description.
  5. Hypotheses tested. Each hypothesis and the result.
  6. Root cause. Final determination.
  7. Recommended fix. With tradeoffs.
  8. Lessons learned. For future reference.

Final Thoughts

Back to the engineer about to add another line to the prompt. With RCA in place, that line does not get written. The investigation runs first. The structural cause gets identified, the fix happens at the right level, and the prompt does not grow another paragraph that future engineers will have to work around.

That is what RCA buys. Not just this fix, but a smaller, cleaner system to maintain, and the diagnostic muscle to handle the next failure faster than the last one.


This methodology has been refined through application across multiple client engagements. The template is provided freely for use in your own evaluation processes.