More is Better

Sending in more requests leads to better evaluation. Testing for stability and consistency, not just correctness, is fundamental to evaluating non-deterministic systems.

Nov 2025 5 min

Key Idea: Sending in more requests leads to better evaluation.

Problem

How can we trust the outputs of an LLM are consistent when it is non-deterministic by nature? A fundamental difference in testing AI is the non-deterministic nature of their output. Deterministic outputs are familiar in traditional software testing for example: “does it have a tick box, yes, PASS”. Non-deterministic results are more difficult, particularly when attempting to measure system quality.

For example: Given the user request “Hello,” an LLM might produce varied responses such as:

  • “Hi! How can I assist you today?”
  • “Hello! How can I help you today?”
  • “Hi there! How can I assist you today?”

Concept

Sending more data to the LLM creates better evaluations. This can be split into the following parts:

Each of these has benefits:

Before trying to assess which is the best answer, all the answer permutations need to be collected. For example, with the “Hello” scenario, what if the 100th answer replied, “I’m not interested in your hello!, go away”?

As humans, we work with our internal statistics. For example, “It’s been correct 10 times; that means it will most likely be correct every time.” However, this cannot be assured, especially if there are nuances in the prompting and context data that’s not fully understood.

Therefore, the this concept is that you must test for stability and consistency, not just correctness.

How to apply it

The application of this concept is a compromise of Cost vs. Confidence.

Caveats

The quality of the data does still matter. Poor data, no matter how much of it, will result in a poor or misdirected evaluation. Synthetic data can be used if real data is not available; however, it does need to be diverse and human-reviewed, at least in part.

The correct test methods need to be applied to ensure the outputs can be validated objectively and in a scalable manner.