More is Better

Sending in more requests leads to better evaluation. Testing for stability and consistency, not just correctness, is fundamental to evaluating non-deterministic systems.

Nov 2025 5 min

Key Idea: Sending in more requests leads to better evaluation.

The first try fallacy “It worked when I tried it”

A developer writes a prompt, runs it three times, sees three good responses, the tester tries a few manual tests, the product manageer tries a few more. Everything looks good, the change ships.. A few days later user issues start to appear in the system after some root cause analysis it’s found that the same prompt produces something strange in production, and nobody can explain it. The prompt was tested. It passed every time it was tested. It just was not tested enough times to find the failure mode.

This is the first try fallacy. With a non-deterministic system, a few runs is not sufficient to track any type of behaviour change or ensure the prompt is behaving as designed.

Problem

How do we trust the outputs of an LLM are consistent when the system is non-deterministic by nature? In traditional software, deterministic outputs are familiar territory. Does it have a tick box? Yes. Pass. Non-deterministic outputs are harder, particularly when the goal is to measure system quality.

Given the user request “Hello,” an LLM might produce a range of responses:

  • “Hi! How can I assist you today?”
  • “Hello! How can I help you today?”
  • “Hi there! How can I assist you today?”

Each of these is a reasonable answer. None of them is the same as the others. A traditional test asking “does the output equal the expected response” cannot give you a useful signal here. A volume of runs can.

Concept

Sending more data to the LLM creates better evaluations. The “more” can be split into three approaches:

Before trying to assess which is the best answer, all the answer permutations need to be collected. With the “Hello” scenario, what if the hundredth answer came back as “I’m not interested in your hello!, go away”? Three runs would never have surfaced that. A hundred would.

As humans we work from internal statistics. “It’s been correct 10 times, so it’ll probably be correct next time.” That heuristic falls apart with LLMs, especially when there are nuances in the prompt and context that are not fully understood. The point of this concept is to replace that heuristic with a real one. Test for stability and consistency, not just correctness.

How to apply it

In practice this is a compromise between cost and confidence.

Running the same request a million times would give you an extreme degree of confidence in the result. It would also cost a significant amount of time, money, and compute. Nobody runs a million calls for a small prompt change.

The practical approach is to define a “test run” of a fixed size, for example 100 calls per prompt. The size should be based on the risk. High risk areas justify more runs. Low risk areas can manage with fewer. This gives you a statistical sample to work with. If 1 in 100 fails, you have a 1% failure rate, and that is a number you can track, threshold, and compare against the next version.

Caveats

The quality of the data still matters. Poor data, no matter how much of it, produces a poor or misdirected evaluation. Synthetic data can fill in where real data is not available, but it needs to be diverse, and at least some of it needs to be human-reviewed before it is trusted.

The right evaluation methods also need to be applied. Volume only helps if the outputs can be validated objectively and at scale. A thousand runs scored by a flawed metric is not better than three runs. It is just slower.

Closing the loop

Back to the developer who shipped after three runs. With this concept in place, the same change goes through 100 runs, the failure mode appears at run 47, and the prompt gets fixed before it ships. The cost is a few minutes of compute. The saving is a production incident that never happens.

That is what More is Better buys you. Not certainty, just enough volume to see the things three runs will never show.