More is Better, How To Test AI

Key Idea: Sending in more requests leads to better evaluation.

The first try fallacy “It worked when I tried it”

A developer writes a prompt, runs it three times, sees three good responses, the tester tries a few manual tests, the product manageer tries a few more. Everything looks good, the change ships.. A few days later user issues start to appear in the system after some root cause analysis it’s found that the same prompt produces something strange in production, and nobody can explain it. The prompt was tested. It passed every time it was tested. It just was not tested enough times to find the failure mode.

This is the first try fallacy. With a non-deterministic system, a few runs is not sufficient to track any type of behaviour change or ensure the prompt is behaving as designed.

Problem

How do we trust the outputs of an LLM are consistent when the system is non-deterministic by nature? In traditional software, deterministic outputs are familiar territory. Does it have a tick box? Yes. Pass. Non-deterministic outputs are harder, particularly when the goal is to measure system quality.

Given the user request “Hello,” an LLM might produce a range of responses:

“Hi! How can I assist you today?”

“Hello! How can I help you today?”

“Hi there! How can I assist you today?”

Each of these is a reasonable answer. None of them is the same as the others. A traditional test asking “does the output equal the expected response” cannot give you a useful signal here. A volume of runs can.

Concept

Sending more data to the LLM creates better evaluations. The “more” can be split into three approaches:

Same request, repeated. Tests whether the LLM responds consistently to the same input. Inconsistency here usually points to miscomprehension or high perplexity.
Same intent, different wording. Tests whether the LLM or the surrounding system is over-fitted or over-prompted to the expected phrasing. Checks that it adapts to requests that mean the same thing but are worded, sized, or punctuated differently.
Different intents. Tests whether the LLM stays on topic. If the system is built for a specific purpose, this measures how much it deviates from its instructions when asked something off-piste.

Before trying to assess which is the best answer, all the answer permutations need to be collected. With the “Hello” scenario, what if the hundredth answer came back as “I’m not interested in your hello!, go away”? Three runs would never have surfaced that. A hundred would.

As humans we work from internal statistics. “It’s been correct 10 times, so it’ll probably be correct next time.” That heuristic falls apart with LLMs, especially when there are nuances in the prompt and context that are not fully understood. The point of this concept is to replace that heuristic with a real one. Test for stability and consistency, not just correctness.

How to apply it

In practice this is a compromise between cost and confidence.

Running the same request a million times would give you an extreme degree of confidence in the result. It would also cost a significant amount of time, money, and compute. Nobody runs a million calls for a small prompt change.

The practical approach is to define a “test run” of a fixed size, for example 100 calls per prompt. The size should be based on the risk. High risk areas justify more runs. Low risk areas can manage with fewer. This gives you a statistical sample to work with. If 1 in 100 fails, you have a 1% failure rate, and that is a number you can track, threshold, and compare against the next version.

Caveats

The quality of the data still matters. Poor data, no matter how much of it, produces a poor or misdirected evaluation. Synthetic data can fill in where real data is not available, but it needs to be diverse, and at least some of it needs to be human-reviewed before it is trusted.

The right evaluation methods also need to be applied. Volume only helps if the outputs can be validated objectively and at scale. A thousand runs scored by a flawed metric is not better than three runs. It is just slower.

Closing the loop

Back to the developer who shipped after three runs. With this concept in place, the same change goes through 100 runs, the failure mode appears at run 47, and the prompt gets fixed before it ships. The cost is a few minutes of compute. The saving is a production incident that never happens.

That is what More is Better buys you. Not certainty, just enough volume to see the things three runs will never show.

More is Better

The first try fallacy “It worked when I tried it”

Problem

Concept

How to apply it

Caveats

Closing the loop

Reducing Risk in Token Cost and Performance in AI Systems

Observability

Statistical Risk