Statistical Risk

Key Idea: Use statistics to assess risk. Measure risk statistically by applying a risk rubric with scoring for probability and impact.

Nov 2025 7 min

Key Idea: Use statistics to assess risk

Summary

LLM

Problem

LLM outputs will always have failures. How do we measure, prioritise, and triage these failures in an objective, data-driven way? How do we decide what to fix first?

Concept

Measure risk statistically by applying a risk rubric with scoring for probability and impact. Below is a very simple 5 by 5 risk rubric. These can be scaled up to greater numbers for finer granularity.

5x5 probability impact risk matrix

This on its own doesn’t mean much, but when some values are added to the rubric, it starts to make more sense and become more useful.

Likelihood

Severity

Why do we need to use it?

Take, for example, the test below. In the example, five different issues were found in production. Which would have the greatest associated risk and therefore should be prioritised first?

Assume these are all responses to user inputs:

Which one is the most high-risk, and which should be addressed first?

Apply the risk rubric:

As can be seen, the risk is very clear:

This allows the risk to be observed objectively. More importantly, due to the non-deterministic nature of LLMs, it allows us to score the potential solutions against the rubric scores.

For example, with Issue 1 (Score 10), there are 2 options to reduce risk:

  1. Try to reduce the occurrence, for example, to 1 in 500 times. This would be an acceptable level of risk. Score of: Likelihood (3) x Impact (2) = 6 (Medium).
  2. Try to get the LLM to reply, “I’d rather not answer that,” instead. This reduces the impact to 1. Giving a score of: Likelihood (5) x Impact (1) = 5 (Medium).

Therefore, the rubric can be used for:

How to apply it

  1. Use the “More is Better” concept to run thousands of requests. This gives us the hard, statistical data to calculate the Likelihood Score (e.g., “1 in 800” failures).
  2. Use “Evaluation over Testing” (with tools like Semantic Scoring, Language Heuristics, or LLM-based judges) to programmatically score the content of those failures. This gives us the Severity Score (e.g., A swear word is a 3; bad advice is a 5).
  3. Use this “Statistical Risk” rubric to multiply them: Likelihood (3) x Severity (5) = Risk Score (15).

Caveats