Evaluation Over Testing, How To Test AI

Key Idea: For LLMs, the question is not whether the answers are right. It is whether the answers are better or worse than the last ones.

The morning the suite went red

You create a test suite for cehcking the outputs of an AI based system thats in development. You do everything right, a good number of tests, diverse test cases, a good mix of both success and failure cases. You run it once, it’s all good, everything passes, you run it again and 50% of tests now fail.. you run it again and now a 70% pass rate.. Nothing has changed in the code. Nothing has changed in the prompts. The model is the same version. The tests are even the same..

Anyone who has tried to run traditional pass and fail tests against an LLM has had this experience. It is the point where you start to suspect the problem is not in the system being tested, it is in the way we think about testing things.

The concept

Testing is evolving into evaluation. The shift is from a binary pass and fail system to an analog system of better or worse. This is more a change in mindset than a technical detail. Traditional software testing mitigates risk by measuring passes and fails. Evaluation mitigates risk by looking at the patterns in the data, tracking trends over time, and applying thresholds to identify if we are getting steadily better or steadily worse.

Why we need it

Consider a calculator, which is a deterministic system. The same input produces the same output every time:

Test 1: Input: “2 + 3”, Output=5, check answer is 5, PASS

Test 2: Input: “2 - 3”, Output=-1, check answer is -1, PASS

Test 3: Input: “2 x 3”, Output=6, check answer is 6, PASS

Test 4: Input: “10 / 2”, Output=4, check answer is 5, FAIL

Pass and fail works perfectly here, because the output is bounded. Now consider the same approach against an LLM:

Prompt: Greet the user

Test 1: Input: “Hello”, Output: “Hello”, check answer is “Hello”, PASS

Test 2: Input: “Hello”, Output: “hello!”, check answer is “Hello”, FAIL

Test 3: Input: “Hello”, Output: “Hi”, check answer is “Hello”, FAIL

Test 4: Input: “Hello”, Output: “Hi there”, check answer is “Hello”, FAIL

In the calculator case, the fails are real fails. In the LLM case, three of the four answers are arguably correct. The model has done the right thing. The test has not. As input complexity and response length increase, the number of acceptable variations expands quickly, and binary checks fall apart. Evaluation steps in to ask a different question: how close is the output to what we expected, and is that close enough.

The same four cases under an evaluation approach:

Test 1: Input: “Hello”, Output: “Hello”, semantic comparison against “Hello”, score = 0.95, threshold 0.9, PASS

Test 2: Input: “Hello”, Output: “hello!”, semantic comparison against “Hello”, score = 0.95, threshold 0.9, PASS

Test 3: Input: “Hello”, Output: “Hi”, semantic comparison against “Hello”, score = 0.93, threshold 0.9, PASS

Test 4: Input: “Hello”, Output: “Hi there”, semantic comparison against “Hello”, score = 0.91, threshold 0.9, PASS

Better still, the scores are now numbers, and numbers can be plotted. Track them over time and behaviour stops being a snapshot and becomes a trend. Take sentiment, which measures how positive a response is. If sentiment trends downward over weeks, that is the signal worth investigating. Where as a single low sentiment score on a single response means almost nothing on its own.

This is what evaluation gives you. Trends instead of snapshots. Comparisons between pre-release and live environments. A way to recognise that LLMs will always produce some non-adherence and some undesired behaviour, and that the question we should be asking is whether that drift is getting worse or staying flat.

This is also where evaluation connects with More is Better. Volume is what makes trends visible. A handful of runs cannot tell you whether you are looking at noise or signal. A thousand runs can.

How to apply it

The shift in approach from traditional testing comes down to a few practical changes:

Test outputs become evaluations, not pass and fail. The output of a test run is a score, not a pass/fail.
Results are analog, not digital. They need to be monitored with action thresholds and plots so the trends are visible.
Scoring thresholds still validate changes. They are how you catch regressions that drop scores below an acceptable level, even if no individual run has technically failed.
Evaluations stop being a nice-to-have. In traditional systems, they are an extra layer on top of the real testing. In LLM systems, they are the closest representation of true system behaviour, and they should be treated that way.

Caveats

Evaluation is only as good as the observability behind it. If the patterns of behaviour cannot be captured at the right level and in the right shape, the value collapses. Observability is what makes the trends visible, and without it the scores are just numbers floating in isolation and it’s very difficult to track a pattern.

Evaluation is also only as good as the data and the testing strategy that produced it. The wrong evaluation type applied to the wrong scenario is worse than no evaluation at all, because it produces confident numbers that mean nothing. A carefully curated dataset and a deliberate strategy are both non-negotiable.

Final Thoughts

Back to the earlier example. With evaluation in place, the same test suite runs do not produce unpredicatable results at all. They produce a chart. The chart shows the scores moving inside their thresholds, occasionally drifting, occasionally spiking, but staying within the band you have agreed represents acceptable behaviour. When something does break, it shows up as a sustained downward trend, not as a single red line on a single morning. The signal is real, and you can act on it.

That is the move from testing to evaluation. The work shifts from chasing individual failures to watching the system behave over time, and trusting the trend instead of the snapshot.

Evaluation Over Testing

The morning the suite went red

The concept

Why we need it

How to apply it

Caveats

Final Thoughts

Reducing Risk in Token Cost and Performance in AI Systems

Observability

Statistical Risk