What is it
Root cause analysis is a process for determining the underlying cause of an issue. This process is a staple of traditional software testing but can also be effectively used for diagnosing problems with Large Language Model (LLM) behaviour, which is often non-deterministic and highly nuanced.
Why root cause analysis matters
Failures to comply with a given prompt or context are difficult to track down. It’s easy to attempt making changes to a prompt without truly understanding the underlying issue. Root cause analysis prevents issues from re-occurring in different forms.
This is especially important when assessing LLMs, where the unwanted behaviour may not present itself immediately and is often only observable as a statistic.
The knee-jerk fix, adding another instruction to the prompt, usually masks the symptom without addressing the cause. RCA finds the actual structural problem.
The six-step process
1. Observability, turn the black box into a white box
For LLMs, this means capturing the exact inputs and outputs needed to reproduce the failure and understand the model’s ‘thinking.’ Collect the full prompt, context, model version, tokens, log probabilities, and final result.
Break the prompt into its structural components:
- Instruction, what the model is told to do
- Context, the data or knowledge provided
- Rules, behavioural constraints
- Examples, any few-shot examples included
2. Categorisation
Assign failure types clear and logical labels. Good labels are specific and actionable:
Good vs poor failure labels
| Poor label | Better label | Why it's better |
|---|---|---|
| Wrong answer | Factual error, medication interaction | Identifies the domain and error type |
| Bad response | Style violation, exceeded length constraint | Points to the specific rule broken |
| Hallucination | Confabulation, fabricated source citation | Distinguishes source fabrication from factual errors |
3. Pattern identification
To capture the different nuances in behaviour, look and test for patterns or clusters in the categorised failures. Common clustering dimensions:
- Position in context, do failures correlate with where information appears in the prompt?
- Interaction length, do failures increase in longer conversations?
- User behaviour, do certain user phrasings trigger more failures?
- Time/version, did a model update change the failure pattern?
4. Investigation
Analyse the identified clusters and patterns. Map each cluster back to the prompt structure from step 1. Ask: which structural element could be causing this pattern?
For each failure cluster, identify: (a) which prompt component is most likely responsible, (b) whether the failure is deterministic or probabilistic, and (c) whether it interacts with other prompt components.
5. Hypothesise and test
Develop clear hypotheses based on the behavioural patterns and clusters identified during categorisation. Test these hypotheses objectively and systematically on minimal, reduced versions of the core functionality.
- Isolate variables, test on minimal, reduced versions of the prompt
- Scale iterations, to validate findings, adopt the “more is better” principle by ensuring sufficient test iterations are conducted to achieve statistical significance
- Stay objective, do not assume the cause; let the data confirm or reject each hypothesis
- Iterate, continue this iterative process until the genuine root cause is isolated
6. Report on findings
Lessons learned from the root cause analysis should be documented and applied to future implementations. Document:
- The observed behaviour (with specific examples)
- The root cause(s) identified
- The evidence from hypothesis testing
- Recommended fixes (with tradeoffs noted)
- Lessons learned for future implementations
The value of RCA isn’t finding this bug, it’s building the diagnostic muscle that catches the next one faster.
Example scenario
The scenario below walks through a simple root cause analysis based on a straightforward application using an LLM to sell fruit.
from openai import OpenAI
client = OpenAI(api_key='ENTER-YOUR-API-KEY')
def get_completion(
messages: list[dict[str, str]],
model: str = "gpt-4.1-mini",
temperature=1,
) -> str:
params = {
"model": model,
"messages": messages,
"temperature": temperature,
}
completion = client.chat.completions.create(**params)
return completion
system_prompt_input = '''
You are a helpful assistant that always provides concise and relevant answers to user queries. you are to try to sell the user fruit.
Below is your context:
Apple
Description: A crisp and juicy fruit with a rounded shape, apples come in a wide variety of colors, including red, green, and yellow. Their flavor ranges from sweet to tart, and they have a dense, firm flesh. They are incredibly versatile, eaten raw, baked in pies, or cooked into sauces.
Banana
Description: This is a long, curved fruit with a soft, creamy white flesh protected by a bright yellow peel (which turns brown as it ripens). Bananas have a distinctly sweet, mild flavor and are a popular, portable snack. They are also a common ingredient in smoothies and baking.
Orange
Description: A round citrus fruit known for its tough, bright orange skin (or "rind") and its juicy, segmented flesh. Oranges have a refreshing sweet-tart flavor and are an excellent source of vitamin C. They are most famous for their juice but are also delicious eaten fresh.
Strawberry
Description: A small, heart-shaped fruit that is bright red when ripe. Its skin is dotted with tiny yellow seeds (called achenes). Strawberries have a sweet, slightly tart, and very fragrant flavor with a soft, juicy texture. They are a popular berry used in desserts, jams, and salads.
Watermelon
Description: A very large, heavy fruit with a thick, striped green rind and a vibrant red or pink interior. The flesh is incredibly watery (hence the name), crisp, and sweet, embedded with small black seeds. It's a classic summer fruit, perfect for hydration and refreshment.
Rules:
1. Always stay on topic.
2. Avoid unnecessary information.
3. Provide clear and direct responses only.
4. reply in no more than a sentence.
'''
def call_the_llm():
user_request = "hello, I'd like some fruit recommendations"
API_RESPONSE = get_completion(
[
{"role": "system", "content": system_prompt_input},
{"role": "user", "content": user_request}
],
model="gpt-4.1-mini",
)
content = API_RESPONSE.choices[0].message.content
print("Output: " + content)
for _ in range(10):
call_the_llm()
The application is run 10 times and gives this output:
Output: Try a crisp apple for a sweet or tart snack, or a juicy orange to refresh and boost your vitamin C!
Output: Try a crisp apple for a sweet or tart snack, or a juicy orange packed with vitamin C!
Output: Try a crisp apple for a versatile snack, a sweet banana for energy, or a juicy orange for refreshing vitamin C.
Output: Try a crisp apple for versatility, a sweet banana for a quick snack, or a juicy orange for a refreshing vitamin C boost!
Output: Try a crisp apple for a versatile snack, a sweet banana for portability, or a refreshing orange packed with vitamin C.
Output: Try a crisp apple for a versatile snack, a sweet banana for energy on the go, or a juicy orange to boost your vitamin C!
Output: Try crisp apples for versatility, sweet bananas for a quick snack, or refreshing oranges for a vitamin C boost!
Output: Try a crisp apple for a sweet or tart snack, or a juicy orange for a refreshing vitamin C boost.
Output: Try a crisp apple for a sweet or tart snack, a sweet banana for a quick energy boost, or a juicy orange for refreshing vitamin C.
Output: Try a crisp apple for a versatile snack, a sweet banana for on-the-go energy, or a juicy orange for a refreshing vitamin C boost!
As can be seen, the LLM has recommended:
Fruits recommended per turn
| Turn | Fruits recommended |
|---|---|
| 1, 2, 8 | Apple, Orange |
| 3, 4, 5, 6, 7, 9, 10 | Apple, Banana, Orange |
Total counts: Apple: 10, Orange: 10, Banana: 7, Strawberry: 0, Watermelon: 0.
Conclusion
It can be seen that the LLM mentioned Apples and Oranges the most, Bananas are third, but Strawberries and Watermelons are not mentioned at all.
Without root cause analysis, the knee-jerk reaction might be to try adding a directive to the prompt:
“All fruits should be represented equally in sales this includes Apples, Oranges, Bananas, Strawberries and Watermelon”
If that is applied, the output now shows as this:
Total: apple: 10, banana: 3, orange: 10, strawberry: 3, watermelon: 4
Where the fruits are still disproportionate, though the results are better. At this point, it may be tempting to release this change. However, this has not uncovered the root cause of the problem.
Root cause analysis (RCA) applied
Observability
In this case, we can collect meta-data by having the prompt and assessing its individual parts:
- Instruction: You are a helpful assistant that always provides concise and relevant answers to user queries. you are to try to sell the user fruit.
- Context: The detailed fruit descriptions (e.g., Apple, Banana, Orange…)
- Rules: 1. Always stay on topic, 2. Avoid unnecessary information, 3. Provide clear and direct responses only, 4. Reply in no more than a sentence.
Categorisation
We use the initial failure counts as our labels: Total: apple: 10, orange: 10, bananas: 7, strawberry: 0, watermelon: 0.
Pattern identification
The cluster is clear: the LLM output heavily favours Apples, Oranges, and Bananas over the other two fruits. This points to what might be seen as a fruit bias.
Investigation
Looking through the prompt, there are two main structural areas that could be causing this issue:
- The ordering of the context correlates with the outcome.
- The output size control statement could be curtailing the LLM from outputting fruits further down in the response.
Hypothesis
- Context Order Bias: The LLM is reading the prompt and choosing the fruits closer to the top of the context more often. The context runs in the order of Apple > Banana > Orange > Strawberry > Watermelon.
- Restriction/Length Bias: The instruction to “reply in no more than a sentence” is causing the model to prioritise a limited number of fruits.
Testing the hypotheses
We must first ensure repeatability. The fruit counting mechanism is added to the script:
list_of_responses = []
for _ in range(10):
response = call_the_llm()
list_of_responses.append(response)
# Use clean names for reporting
fruit_count = {
"apple": 0,
"banana": 0,
"orange": 0,
"strawberry": 0,
"watermelon": 0,
}
fruit_search_roots = {
"apple": "apple",
"banana": "banana",
"orange": "orange",
"strawberry": "strawberr",
"watermelon": "watermelon",
}
for response in list_of_responses:
if response:
response_lower = response.lower()
for fruit_name in fruit_count.keys():
root_to_find = fruit_search_roots[fruit_name]
if root_to_find in response_lower:
fruit_count[fruit_name] += 1
print("Fruit Counts")
pprint.pprint(fruit_count)
We reproduce the issue by testing the hypothesis on cut-down versions of the functionality.
Testing Hypothesis 1: Context Order Bias
We reverse the order of the context: Watermelon > Strawberry > Orange > Banana > Apple. The number of runs is increased to 50 to ensure statistical significance.
Output: {'apple': 24, 'banana': 0, 'orange': 13, 'strawberry': 49, 'watermelon': 48}
It can clearly be seen that the ordering has a significant effect. However, there are still 24 apples even though they are now positioned last, potentially indicating a residual bias towards apples.
Testing Hypothesis 2: Restriction/Length Bias
We remove the output limit instruction: “reply in no more than a sentence.”
Output: {'apple': 50, 'banana': 50, 'orange': 49, 'strawberry': 46, 'watermelon': 49}
Here, it can be seen that the LLM is now representing all fruits equally (within statistical margin). However, this is achieved at the cost of extra tokens and verbosity. Any solution using this approach would need to assess if the extra verbosity was desirable and fits within cost expectations.
Testing Hypothesis 3: Residual Apple Bias
Based on the results of Hypothesis 1, we further test the apple bias against the order bias in a minimal setting (just asking for a random pick).
Testing Hypothesis 3: The LLM has a bias for apples (First Listed)
Prompt asks the LLM to pick a fruit randomly from the list: Apple, Banana, Orange, Strawberry, Watermelon.
system_prompt_input = '''
You are to pick a fruit at random from the list, Apple, Banana, Orange, Strawberry, Watermelon.
1. Reply with only the name of the fruit, and nothing else.
2. The choice must be totally random.
3. Think carefully about each choice and ignore any order or pattern in the list.
'''
Output: {'apple': 0, 'banana': 5, 'orange': 12, 'strawberry': 20, 'watermelon': 13}
The result shows zero apples when the LLM is explicitly asked to be random. This result seems contradictory to Hypothesis 1’s residual apple count and warrants further investigation (likely related to Hypothesis 4).
Testing Hypothesis 4: Order Dominance
Testing Hypothesis 4: The order of the list still affects the output more than the bias (Reversed List)
Prompt asks the LLM to pick a fruit randomly from the reversed list: Watermelon, Strawberry, Orange, Banana, Apple.
system_prompt_input = '''
You are to pick a fruit at random from the list, Watermelon, Strawberry, Orange, Banana, Apple.
1. Reply with only the name of the fruit, and nothing else.
2. The choice must be totally random.
3. Think carefully about each choice and ignore any order or pattern in the list.
'''
Output: {'apple': 31, 'banana': 3, 'orange': 15, 'strawberry': 1, 'watermelon': 0}
This test strongly confirms that positional bias (order) is the dominant factor, as the fruits at the end of the input list (which are now Apples) are selected the most. The slight positional bias remains, regardless of the fruit.
Root cause report
Observed behaviour
The LLM was mentioning Apples, Oranges, and Bananas much more than Watermelons and Strawberries.
Root cause analysis outcome
The LLM’s output is governed by two interacting root causes:
- Context Order Bias: The model exhibits a strong tendency to favour fruits listed earlier in the context.
- Output Restriction: The rule “reply in no more than a sentence” severely limits the output length, reinforcing the context order bias by forcing the model to only mention the first few items it considered.
The idea of testing for seasonal bias is an interesting follow-up experiment, but it is outside the scope of the current RCA on the prompt length/order issue.
Outcome
This final step demonstrates the true value of RCA. The ‘knee-jerk’ fix was a fragile patch that only hid the symptom. The methodical RCA process uncovered the real problem, allowing for a permanent, robust fix (such as randomising the context order for each call, or carefully removing/rewording the conflicting output restriction rule). This is what prevents the same bug from re-occurring in a different form.
Common root causes in LLM applications
From our experience, the most frequent root causes fall into a few categories:
Common root cause categories
| Root cause | How it manifests | Typical fix |
|---|---|---|
| Context order bias | Model favours information listed earlier in the prompt | Randomise context order per request |
| Conflicting instructions | Two rules contradict, model satisfies one at the expense of the other | Audit rules for conflicts, prioritise explicitly |
| Output constraint interference | Length/format rules prevent the model from completing its task | Relax constraints or restructure the task |
| Insufficient context | Model confabulates to fill gaps in provided information | Add explicit 'say I don't know' instructions |
| Temperature mismatch | Too much randomness for factual tasks, or too little for creative ones | Align temperature with task requirements |
Template
Use this structure for your own RCA reports:
- Observed behaviour: [What happened, with specific examples]
- Reproduction steps: [Exact prompt, model, settings]
- Failure categorisation: [Labels applied]
- Pattern identified: [Cluster description]
- Hypotheses tested: [Each hypothesis + result]
- Root cause: [Final determination]
- Recommended fix: [With tradeoffs]
- Lessons learned: [For future reference]
This methodology has been refined through application across multiple client engagements. The template is provided freely for use in your own evaluation processes.