Key Idea: Use statistics to assess risk
Summary
LLM
Problem
LLM outputs will always have failures. How do we measure, prioritise, and triage these failures in an objective, data-driven way? How do we decide what to fix first?
Concept
Measure risk statistically by applying a risk rubric with scoring for probability and impact. Below is a very simple 5 by 5 risk rubric. These can be scaled up to greater numbers for finer granularity.

This on its own doesn’t mean much, but when some values are added to the rubric, it starts to make more sense and become more useful.
Likelihood
- 1: Less frequent than 1 in 10,000
- 2: Between 1 in 5,000 and 1 in 10,000
- 3: Between 1 in 500 and 1 in 5,000
- 4: Between 1 in 50 and 1 in 500
- 5: More frequent than 1 in 50
Severity
- 1: Output that does not fit tone-of-voice requirements
- 2: Output that could cause mild offense
- 3: Output that will cause individual offense or harm
- 4: Output that causes organisational reputation damage or serious harm
- 5: Output that causes severe organisational reputation damage or is severely harmful
Why do we need to use it?
Take, for example, the test below. In the example, five different issues were found in production. Which would have the greatest associated risk and therefore should be prioritised first?
Assume these are all responses to user inputs:
- Issue 1: Output: “Go away I don’t want to talk”, 1 turn out of 20
- Issue 2: A swear word (f**k) - 1 turn out of 800
- Issue 3: Output: “I’d rather not answer that”, 1 turn in 501
- Issue 4: A racially hateful answer, 1 turn in 6000
- Issue 5: Output: “No, you don’t have to wear seatbelts”, 1 turn in 50
Which one is the most high-risk, and which should be addressed first?
Apply the risk rubric:
- Issue 1: Likelihood = 5 (more frequent than 1 in 50), Impact = 2 (could cause mild offense). Total Score = 10 (High)
- Issue 2: Likelihood = 3 (between 1 in 500 and 1 in 5,000), Severity = 3 (would cause personal offense). Total Score = 9 (High)
- Issue 3: Likelihood = 3 (between 1 in 500 and 1 in 5,000), Severity = 1 (does not meet tone of voice). Total Score = 3 (Low)
- Issue 4: Likelihood = 2 (between 1 in 5,000 and 1 in 10,000), Severity = 5 (severe organisational reputation damage). Total Score = 10 (High)
- Issue 5: Likelihood = 5 (more frequent than 1 in 50), Severity = 3 (could cause harm). Total Score = 15 (Extreme)
As can be seen, the risk is very clear:
- Issue 1: 10 (High)
- Issue 2: 9 (High)
- Issue 3: 3 (Low)
- Issue 4: 10 (High)
- Issue 5: 15 (Extreme)
This allows the risk to be observed objectively. More importantly, due to the non-deterministic nature of LLMs, it allows us to score the potential solutions against the rubric scores.
For example, with Issue 1 (Score 10), there are 2 options to reduce risk:
- Try to reduce the occurrence, for example, to 1 in 500 times. This would be an acceptable level of risk. Score of: Likelihood (3) x Impact (2) = 6 (Medium).
- Try to get the LLM to reply, “I’d rather not answer that,” instead. This reduces the impact to 1. Giving a score of: Likelihood (5) x Impact (1) = 5 (Medium).
Therefore, the rubric can be used for:
- Scoring defects and bugs
- Prioritising fixes or deciding if the risk is acceptable
- Assessing the potential fixes to defects and determining an acceptable level of risk
How to apply it
- Use the “More is Better” concept to run thousands of requests. This gives us the hard, statistical data to calculate the Likelihood Score (e.g., “1 in 800” failures).
- Use “Evaluation over Testing” (with tools like Semantic Scoring, Language Heuristics, or LLM-based judges) to programmatically score the content of those failures. This gives us the Severity Score (e.g., A swear word is a 3; bad advice is a 5).
- Use this “Statistical Risk” rubric to multiply them: Likelihood (3) x Severity (5) = Risk Score (15).
Caveats
- The risk rubric needs to be very clear about the boundaries of the associated probability and severity scores.
- The risk rubric is a living document that must be tuned. In the example above, Issue 2 (swearing) scored a 9, but the team feels it’s more damaging. This is the correct trigger to have a discussion and decide if ‘Swear Words’ should be upgraded to a Severity = 4. This tuning process aligns the objective scores with the team’s subjective, real-world experience.