Confidence Scoring

Scoring how unsure the LLM is based on what you are asking it to do. Confidence is derived from logprobs and provides a direct insight into token-level certainty.

Nov 2025 8 min

Key Idea: Scoring how unsure the LLM is based on what you are asking it to do.

The answer that sounds right

You need to implement a test for measuring factual accuracy of responses. You do all the right things: set up an LLM judge, feed it the same context as the prompt call you are testing, make the prompt fact-check specific. You refine the judge on a benchmark to ensure it’s accurate.

You then add the LLM-based judge in your test pipeline. You ask it “is this response factually correct, true or false?” It replies “true.” You move on. A week later you find a failure that the judge cleared, and you decide to re-run the test because you know it said it was OK before. The judge now says “false.” The judge sounds equally certain every time, because the output is just a word. The actual uncertainty behind that word is invisible to you.

This is the gap that confidence scoring fills. The model already knows how sure it was. It is sitting in the API response. You just have to ask for it.

What it is

Confidence is a measure of LLM uncertainty. It is derived from the logprobs that some models provide back via the API, and it scores how unsure the LLM is when selecting each output token.

How it works

For every token the LLM outputs, the token is selected based on log probabilities. In simple terms, the model has internal weightings and biases over the possible next tokens, and the highest probability one is the token that gets selected.

Models use log probabilities, or logprobs, instead of standard 0 to 1 probabilities for numerical stability. Multiplying many small probabilities together leads to errors quickly, while adding their logarithms is fast and stable. Any logprob can be converted back to a regular 0 to 1 probability by calculating math.exp(logprob).

Confidence is scored in a normalised form from 0 to 1. Zero is no confidence, one is perfect confidence. In reality you will rarely see exact zeros or exact ones, only values that round to them. The above is brief and high level. The underlying mechanism is the basis of how transformer models work, and is worth reading into properly.

Code example

In the example below, the confidence is calculated for each token output, with the top three candidate tokens captured per position.

from openai import OpenAI
import math

# Placeholder for OpenAI API Key
client = OpenAI(api_key='YOUR-API-KEY-HERE')
top_logprobs_count = 3

def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4.1-mini",
    temperature=1,
    logprobs=None,
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "temperature": temperature,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }

    completion = client.chat.completions.create(**params)
    return completion

# Run 1: Polite Greeting
system_prompt_input = "you are to greet the user"
user_request = "hello"

API_RESPONSE = get_completion(
    [
        {"role": "system", "content": system_prompt_input},
        {"role": "user", "content": user_request}
    ],
    model="gpt-4.1-mini",
    logprobs=True,
    top_logprobs=3
)

content = API_RESPONSE.choices[0].message.content
print("Output: " + content)

top_logprobs_list = []
for token_logprob in API_RESPONSE.choices[0].logprobs.content:
    top_logprobs_list.append(token_logprob.top_logprobs)
    list_of_responses = token_logprob.top_logprobs
    for an_entry in list_of_responses:
        logprob_value = an_entry.logprob
        probability = math.exp(logprob_value)
        print(str(an_entry) + " " + "Probability = " + str(probability))
    print("**********************************")

The output from the script should look something like this:

Output: Hello! How can I assist you today?
TopLogprob(token='Hello', ...) Probability = 0.9999988527586979
TopLogprob(token='Hi', ...) Probability = 1.0677029917933763e-06
TopLogprob(token=' Hello', ...) Probability = 1.2501504819116561e-09
**********************************
TopLogprob(token='!', ...) Probability = 1.0
**********************************
TopLogprob(token=' How', ...) Probability = 1.0
**********************************
TopLogprob(token=' can', ...) Probability = 0.9999783499621913
**********************************
TopLogprob(token=' I', ...) Probability = 1.0
**********************************
TopLogprob(token=' assist', ...) Probability = 0.989011932447037
TopLogprob(token=' help', ...) Probability = 0.010986931105910237
**********************************
TopLogprob(token=' you', ...) Probability = 1.0
**********************************
TopLogprob(token=' today', ...) Probability = 1.0
**********************************
TopLogprob(token='?', ...) Probability = 1.0
**********************************

What this means

Each output token has up to three candidate variants here, with the logprob values converted to confidence scores. The first token, token='Hello', has a logprob of -1.1472419600977446e-06, which translates to a probability of 0.9999988527586979. The model is extremely confident that this is the first token to output.

The interesting position is later in the response:

token=' assist', Probability = 0.989011932447037
token=' help', Probability = 0.010986931105910237
token='assist', Probability = 1.055971596147144e-06

The ” assist” token is the clear winner. But the model was also considering ” help,” which would have produced “Hello! How can I help you today?” The third candidate, “assist” with no leading whitespace, scored vanishingly low because it would have been grammatically poor in that position.

This is the level of detail confidence scoring exposes. Even on a phrase the model effectively has memorised, you can see the points where it had a real choice to make.

Now consider a more abstract scenario. Change the system prompt to this:

```python system_prompt_input = “Every response must be random words, not related to the user request. use no more than 10 words”

The rest of the script (the ‘get_completion’ call and ‘for’ loop)

would be run again with this new ‘system_prompt_input’

```

Which then gives this output:

``` Output: Pineapple river dancing clouds whisper lanterns silent melody. TopLogprob(token=‘P’, …) Probability = 0.18825104578092525 TopLogprob(token=‘Sun’, …) Probability = 0.14661006186819547 TopLogprob(token=‘Mar’, …) Probability = 0.07847471115082784


TopLogprob(token=‘ine’, …) Probability = 0.9981165684011469 TopLogprob(token=‘anc’, …) Probability = 0.000910164310172346 TopLogprob(token=‘encil’, …) Probability = 0.0004299311776240147


TopLogprob(token=‘apple’, …) Probability = 0.9998400972637154 TopLogprob(token=’ apple’, …) Probability = 0.00010889132218537702 TopLogprob(token=‘cone’, …) Probability = 3.535183642139565e-05


TopLogprob(token=’,’, …) Probability = 0.13681007749560617 TopLogprob(token=’ bicycle’, …) Probability = 0.13681007749560617 TopLogprob(token=’ river’, …) Probability = 0.10654780818712543


TopLogprob(token=’ dance’, …) Probability = 0.21232352365188653 TopLogprob(token=’ dancing’, …) Probability = 0.10029454288930743 TopLogprob(token=’ bicycle’, …) Probability = 0.06893136404906235


TopLogprob(token=’ clouds’, …) Probability = 0.3521892898952381 TopLogprob(token=’ shadows’, …) Probability = 0.10090392095776979 TopLogprob(token=’ swiftly’, …) Probability = 0.10090392095776979


TopLogprob(token=’ whisper’, …) Probability = 0.5935058959152182 TopLogprob(token=’ swiftly’, …) Probability = 0.037941567062417685 TopLogprob(token=’ melody’, …) Probability = 0.037941567062417685


TopLogprob(token=’ silent’, …) Probability = 0.2941753772245046 TopLogprob(token=‘ing’, …) Probability = 0.13895860884082684 TopLogprob(token=’ melody’, …) Probability = 0.07437918347315256


TopLogprob(token=’s’, …) Probability = 0.6012930655440837 TopLogprob(token=’ melody’, …) Probability = 0.06337583039775994 TopLogprob(token=’ bicycle’, …) Probability = 0.055928974024749806


TopLogprob(token=’ bicycle’, …) Probability = 0.1855304796901887 TopLogprob(token=’ beneath’, …) Probability = 0.08763840356695249 TopLogprob(token=’ melody’, …) Probability = 0.0682528573250704


TopLogprob(token=’ breeze’, …) Probability = 0.44619380153604793 TopLogprob(token=’ melody’, …) Probability = 0.18600126036288722 TopLogprob(token=’ ocean’, …) Probability = 0.07753686105762125


TopLogprob(token=’.’, …) Probability = 0.7282716944205052 TopLogprob(token=’ breeze’, …) Probability = 0.11168409486573726 TopLogprob(token=’ tul’, …) Probability = 0.024920089973424382


```

In the polite-greeting run, almost every token sat at or near 1.0 confidence. The model knew exactly what was coming. Here, the top-choice probabilities are spread across the range. The first token “P” was selected at just 18%, with “Sun” and “Mar” close behind. The model had genuine uncertainty about where to start.

Once “P” was selected, the next two tokens “ine” and “apple” lock in at 99.8% and 99.9% because once the model had committed to “P,” the path to “Pineapple” became almost forced. The first token in a sequence is often where the real uncertainty lives. The tokens that follow are constrained by what came before.

The system prompt is asking the model to do something that goes against its training, forcing it to select tokens where there is no obvious right answer. Some tokens still hold confidence because the output happens to drift into related territory. But the overall pattern is much noisier than the greeting case, and that noise is exactly the signal confidence scoring is built to detect.

How this is useful

LLM confidence can be measured directly from logprobs, and that opens up a few practical uses.

The most common is sanity checking prompts and context. Low confidence across a response can mean the model has a poor grasp of what is being asked, or that the context it has been given does not actually answer the question. That is a useful signal long before the response itself is reviewed.

The second is scoring boolean LLM judges. Asking “is an apple green, true or false” gives you a one-word answer, but the confidence behind that word is what tells you whether to trust it. A judge that returns “true” at 0.99 confidence is a different signal from one that returns “true” at 0.55 confidence, even though the visible output is identical.

The third is overall response confidence. Aggregating the confidence across every token in a response gives you a single number per response, which is useful for comparison. Response_A_Confidence = 0.95 versus Response_B_Confidence = 0.62 is a comparison you can plot, threshold, and act on.

Key points

Pros

Cons

Final Thoghts

Back to the judge that sounded right. With confidence scoring in place, the same call returns “true” plus a number. The “true” answers your question. The number tells you whether to believe it. A judge running at 0.99 over a thousand calls is doing a different job from one running at 0.6, even if both look the same in the response field. The confidence is the part of the answer the model has been giving you all along. It is just a matter of asking for it.