Key Idea: Sending in more requests leads to better evaluation.
Problem
How can we trust the outputs of an LLM are consistent when it is non-deterministic by nature? A fundamental difference in testing AI is the non-deterministic nature of their output. Deterministic outputs are familiar in traditional software testing for example: “does it have a tick box, yes, PASS”. Non-deterministic results are more difficult, particularly when attempting to measure system quality.
For example: Given the user request “Hello,” an LLM might produce varied responses such as:
- “Hi! How can I assist you today?”
- “Hello! How can I help you today?”
- “Hi there! How can I assist you today?”
Concept
Sending more data to the LLM creates better evaluations. This can be split into the following parts:
- Sending the same request over again.
- Sending many different requests with the same intent but different wording.
- Sending many different requests with different intents.
Each of these has benefits:
- Sending the same request: Tests that the LLM responds consistently to the same request; inconsistency is likely due to miscomprehension or high perplexity.
- Same intent, different wording: Tests that the LLM or supported system is not over-fitted or overly prompted to the intended scenarios, checking if it can adapt to requests that are similar but with different wording, length, and grammar.
- Different intents: Tests that the LLM can stay on topic. If the LLM is for a specific purpose, this is useful for measuring its lack of prompt adherence for example: making sure the LLM does not deviate from its instructions too much.
Before trying to assess which is the best answer, all the answer permutations need to be collected. For example, with the “Hello” scenario, what if the 100th answer replied, “I’m not interested in your hello!, go away”?
As humans, we work with our internal statistics. For example, “It’s been correct 10 times; that means it will most likely be correct every time.” However, this cannot be assured, especially if there are nuances in the prompting and context data that’s not fully understood.
Therefore, the this concept is that you must test for stability and consistency, not just correctness.
How to apply it
The application of this concept is a compromise of Cost vs. Confidence.
- Running the same request 1,000,000 times would give an extreme degree of confidence, but it would also cost significant time and compute.
- You don’t run a test 1,000,000 times for every simple change. You define a “test run” (e.g., 100 calls per prompt). This should be based on risk. High risk more runs, low risk, less runs.
- This gives you a statistical sample. If 1/100 fails, you have a 1% failure rate.
Caveats
The quality of the data does still matter. Poor data, no matter how much of it, will result in a poor or misdirected evaluation. Synthetic data can be used if real data is not available; however, it does need to be diverse and human-reviewed, at least in part.
The correct test methods need to be applied to ensure the outputs can be validated objectively and in a scalable manner.