Key Idea: It’s not “is it right or wrong?”, it’s “is it better or worse?”
Problem
Tests will intermittently fail. How can we trust the changes we make are successful?
Concept
Testing is evolving into evaluation. The shift is from a Boolean pass/fail system to an analog evaluation system of “better or worse.” This concept is more about a way of thinking than a technical detail. Traditional software development testing revolves around mitigating risk by measuring passes and fails. Evaluations mitigate risk by looking at the patterns of the data, tracking trends, and applying thresholds.
Why do we need to use it
In a traditional approach, given the example below:
In an approach evaluating a calculator (a deterministic system):
- Test 1: Input: “2 + 3”, Output=5, check answer is 5 - PASS
- Test 2: Input: “2 - 3”, Output=-1, check answer is -1 - PASS
- Test 3: Input: “2 x 3”, Output=6, check answer is 6 - PASS
- Test 4: Input: “10 / 2”, Output=4, check answer is 5 - FAIL
In an approach evaluating an LLM text output (a non-deterministic system):
Prompt: Greet the user
- Test 1: Input: “Hello”, Output: “Hello”, check answer is “Hello”, PASS
- Test 2: Input: “Hello”, Output: “hello!”, check answer is “Hello”, FAIL
- Test 3: Input: “Hello”, Output: “Hi”, check answer is “Hello”, FAIL
- Test 4: Input: “Hello”, Output: “Hi there”, check answer is “Hello”, FAIL
As can be seen, in traditional software testing, the pass/fail scenarios are working as expected. Whereas in the LLM testing scenario, it could be argued that all of the answers are correct, but they cannot all be validated with the same simple test. Additionally, as input complexity and length increase, variables expand exponentially. Therefore, instead of testing, evaluation is employed to determine if an output is “better or worse” than expected.
For example:
- Test 1: Input: “Hello”, Output: “Hello” → Compare output against “Hello” semantically → score = 0.95 - over threshold 0.9 - PASS
- Test 2: Input: “Hello”, Output: “hello!” → Compare output against “Hello” semantically → score = 0.95 - over threshold 0.9 - PASS
- Test 3: Input: “Hello”, Output: “Hi” → Compare output against “Hello” semantically → score = 0.93 - over threshold 0.9 - PASS
- Test 4: Input: “Hello”, Output: “Hi there” → Compare output against “Hello” semantically → score = 0.91 - over threshold 0.9 - PASS
Even better, we could plot these numbers and track upward or downward trends of behaviour over time. Take, for example, tracking sentiment (which is how positive the response is); if we were to see a downward trend in output sentiment, that would trigger an investigation into why the sentiment had decreased over time.
This allows the evaluations to be more about trends of behavior. It also allows comparison between live environments and pre-release environments and fits in with the concept of “More is Better.” There will always be some non-adherence or undesired behavior within the system; “better/worse” recognises the non-deterministic nature of LLMs.
How to apply it
This is a change in approach from traditional software testing:
- Test outputs will be considered evaluations rather than pass/fail.
- Results will often be analog rather than digital, in they need to be monitored with action thresholds and plots for visibility.
- Scoring thresholds can still be used to validate changes and ensure regressions have not caused scores to drop below an acceptable threshold.
- Often, evaluations are seen as “nice-to-haves” in traditional systems, but in this case, they are the closest representation of true system behaviour and should be taken seriously.
Caveats
- Evaluations are only as good as the observability applied to them. If the patterns of behaviour cannot be observed at the correct level and using the correct representations, a lot of the value is lost.
- Evaluations are only as good as the data and the testing strategy applied to generating them. They need a lot of careful consideration to get the correct type of evaluation for the correct purpose and a carefully curated data set.