Statistical Risk

Use statistics to assess risk. Measure risk statistically by applying a risk rubric with scoring for probability and impact.

Nov 2025 7 min

Key Idea: Use statistics to assess risk.

Five bugs and one engineer

Five issues have been identified in production. The team has time to fix two before the next release. The debate starts. One person wants to fix the swearing because the screenshot is on Slack. Another wants to fix the refusals because a customer complained loudly. A third wants to fix the seatbelt advice because it sounds dangerous. Everyone is right about something. Nobody can prove which one matters most.

This is the meeting that statistical risk scoring exists to shorten. The goal is not to make the discussion go away, it is to give the discussion a number to argue against rather than a feeling.

Problem

LLM outputs will always have failures. How do we measure, prioritise, and triage them in an objective, data-driven way? How do we decide what to fix first when everything looks bad in isolation?

Concept

Measure risk statistically by applying a rubric with scoring for likelihood and severity. Below is a simple 5 by 5 risk matrix. The same approach scales to larger grids when finer granularity is needed.

5x5 probability impact risk matrix

The matrix on its own does not mean much. The values you assign to each axis are what give it weight.

Likelihood

Severity

Why we need it

Consider the five issues below, all reported in production. Which one is the highest risk, and which should be fixed first?

Pick one by instinct. Now apply the rubric and see if the answer is the same.

The order is now clear:

The seatbelt advice wins the priority slot. Not because it sounds the worst on Slack, but because it is both frequent and harmful. The discriminatory output, which most people would name first by instinct, scores the same as a slightly rude refusal because it is rare. That is not the rubric saying it does not matter. It is the rubric saying it can be addressed second, and the seatbelt advice cannot wait.

The same scoring works in reverse, on the proposed fixes. With Issue 1 (score 10), there are two routes to reduce risk:

  1. Reduce occurrence to around 1 in 500. New score: Likelihood (3) x Severity (2) = 6 (Medium).
  2. Reframe the response as “I’d rather not answer that.” New score: Likelihood (5) x Severity (1) = 5 (Medium).

Both are acceptable. The one you pick depends on which is cheaper to implement, but the rubric tells you that either is good enough. That is what makes it useful as a decision tool, not just a prioritisation one.

The rubric can be used for:

How to apply it

The rubric only works when fed with real numbers. Three concepts plug into it directly:

  1. Use More is Better to run thousands of requests. That gives you the hard statistical data needed for the likelihood score, for example “1 in 800 failures.”
  2. Use Evaluation over Testing, supported by semantic scoring, language heuristics, or LLM judges, to programmatically score the content of the failures. That gives you the severity score.
  3. Apply this rubric to multiply them: Likelihood (3) x Severity (5) = Risk Score (15).

The result is a single number per issue, derived from real data, comparable across issues, and revisitable when the data changes.

Caveats

The rubric needs clear boundaries. The likelihood thresholds and severity definitions both need to be agreed up front, because once scores are being argued, vague definitions are where the argument lives.

The rubric is also a living document. In the example above, Issue 2 (swearing) scored a 9, but a team might feel it deserves higher because of brand sensitivity or a regulated audience. That feeling is the trigger to have a calibration discussion and decide whether swearing should move from Severity 3 to Severity 4. The rubric does not override the team’s judgement. It surfaces where the team’s judgement and the scoring system have drifted apart, and forces an explicit decision about which one to move.

Final Thoughts

Back to the triage meeting. With the rubric in place, the five issues come in already scored. The seatbelt advice is the obvious priority. The discriminatory output is the second one. The refusals get parked. The conversation moves from “which one feels worst” to “do we agree with the scores,” which is a much shorter and more productive discussion.

That is what statistical risk gives you. Not the right answer in every case, but a defensible answer in every case, and a record of how it was reached.