Evaluation Over Testing

It's not 'is it right or wrong?', it's 'is it better or worse?' Testing is evolving into evaluation, shifting from Boolean pass/fail to analog better-or-worse.

Nov 2025 6 min

Key Idea: It’s not “is it right or wrong?”, it’s “is it better or worse?”

Problem

Tests will intermittently fail. How can we trust the changes we make are successful?

Concept

Testing is evolving into evaluation. The shift is from a Boolean pass/fail system to an analog evaluation system of “better or worse.” This concept is more about a way of thinking than a technical detail. Traditional software development testing revolves around mitigating risk by measuring passes and fails. Evaluations mitigate risk by looking at the patterns of the data, tracking trends, and applying thresholds.

Why do we need to use it

In a traditional approach, given the example below:

In an approach evaluating a calculator (a deterministic system):

  • Test 1: Input: “2 + 3”, Output=5, check answer is 5 - PASS
  • Test 2: Input: “2 - 3”, Output=-1, check answer is -1 - PASS
  • Test 3: Input: “2 x 3”, Output=6, check answer is 6 - PASS
  • Test 4: Input: “10 / 2”, Output=4, check answer is 5 - FAIL

In an approach evaluating an LLM text output (a non-deterministic system):

Prompt: Greet the user

  • Test 1: Input: “Hello”, Output: “Hello”, check answer is “Hello”, PASS
  • Test 2: Input: “Hello”, Output: “hello!”, check answer is “Hello”, FAIL
  • Test 3: Input: “Hello”, Output: “Hi”, check answer is “Hello”, FAIL
  • Test 4: Input: “Hello”, Output: “Hi there”, check answer is “Hello”, FAIL

As can be seen, in traditional software testing, the pass/fail scenarios are working as expected. Whereas in the LLM testing scenario, it could be argued that all of the answers are correct, but they cannot all be validated with the same simple test. Additionally, as input complexity and length increase, variables expand exponentially. Therefore, instead of testing, evaluation is employed to determine if an output is “better or worse” than expected.

For example:

  • Test 1: Input: “Hello”, Output: “Hello” → Compare output against “Hello” semantically → score = 0.95 - over threshold 0.9 - PASS
  • Test 2: Input: “Hello”, Output: “hello!” → Compare output against “Hello” semantically → score = 0.95 - over threshold 0.9 - PASS
  • Test 3: Input: “Hello”, Output: “Hi” → Compare output against “Hello” semantically → score = 0.93 - over threshold 0.9 - PASS
  • Test 4: Input: “Hello”, Output: “Hi there” → Compare output against “Hello” semantically → score = 0.91 - over threshold 0.9 - PASS

Even better, we could plot these numbers and track upward or downward trends of behaviour over time. Take, for example, tracking sentiment (which is how positive the response is); if we were to see a downward trend in output sentiment, that would trigger an investigation into why the sentiment had decreased over time.

This allows the evaluations to be more about trends of behavior. It also allows comparison between live environments and pre-release environments and fits in with the concept of “More is Better.” There will always be some non-adherence or undesired behavior within the system; “better/worse” recognises the non-deterministic nature of LLMs.

How to apply it

This is a change in approach from traditional software testing:

Caveats