Key Idea: Scoring how unsure the LLM is based on what you are asking it to do.
What is it
Confidence is a measure of LLM uncertainty. It is derived from the logprobs that some models provide back via the API. The confidence is a score of how unsure the LLM is when selecting output tokens.
How does it work
This is illustrated best in the outputs below. For each token the LLM outputs, the token is selected by the LLM based on log probabilities. In simple terms, it selects tokens (parts of words or words) based on internal weightings and bias. The highest probability token is the one that gets outputted.
Models use log probabilities (logprobs) instead of standard probabilities (0 to 1) for numerical stability. Multiplying many small probabilities together can lead to errors, while adding their logarithms is fast and stable. Any logprob can be converted back to a regular 0-1 probability by calculating math.exp(logprob).
Confidence is scored in a normalised form from 0 to 1, 0 being no confidence and 1 being perfect confidence. In reality, these will only be rounded to those values. This is a very brief and high-level description. It’s worth reading more into this area of machine learning as it is the basis for how transformer-based AIs work.
Code example
In the example below, the confidence is calculated for each token output, which is limited to 3.
from openai import OpenAI
import math
# Placeholder for OpenAI API Key
client = OpenAI(api_key='YOUR-API-KEY-HERE')
top_logprobs_count = 3
def get_completion(
messages: list[dict[str, str]],
model: str = "gpt-4.1-mini",
temperature=1,
logprobs=None,
top_logprobs=None,
) -> str:
params = {
"model": model,
"messages": messages,
"temperature": temperature,
"logprobs": logprobs,
"top_logprobs": top_logprobs,
}
completion = client.chat.completions.create(**params)
return completion
# Run 1: Polite Greeting
system_prompt_input = "you are to greet the user"
user_request = "hello"
API_RESPONSE = get_completion(
[
{"role": "system", "content": system_prompt_input},
{"role": "user", "content": user_request}
],
model="gpt-4.1-mini",
logprobs=True,
top_logprobs=3
)
content = API_RESPONSE.choices[0].message.content
print("Output: " + content)
top_logprobs_list = []
for token_logprob in API_RESPONSE.choices[0].logprobs.content:
top_logprobs_list.append(token_logprob.top_logprobs)
list_of_responses = token_logprob.top_logprobs
for an_entry in list_of_responses:
logprob_value = an_entry.logprob
probability = math.exp(logprob_value)
print(str(an_entry) + " " + "Probability = " + str(probability))
print("**********************************")
The output from the script should look something like this:
Output: Hello! How can I assist you today?
TopLogprob(token='Hello', ...) Probability = 0.9999988527586979
TopLogprob(token='Hi', ...) Probability = 1.0677029917933763e-06
TopLogprob(token=' Hello', ...) Probability = 1.2501504819116561e-09
**********************************
TopLogprob(token='!', ...) Probability = 1.0
**********************************
TopLogprob(token=' How', ...) Probability = 1.0
**********************************
TopLogprob(token=' can', ...) Probability = 0.9999783499621913
**********************************
TopLogprob(token=' I', ...) Probability = 1.0
**********************************
TopLogprob(token=' assist', ...) Probability = 0.989011932447037
TopLogprob(token=' help', ...) Probability = 0.010986931105910237
**********************************
TopLogprob(token=' you', ...) Probability = 1.0
**********************************
TopLogprob(token=' today', ...) Probability = 1.0
**********************************
TopLogprob(token='?', ...) Probability = 1.0
**********************************
What does this mean
As can be seen from the above output, each of the output tokens has 3 potential variants with the logprob values that have been converted to confidence scores. The first token, token='Hello', has a logprob of -1.1472419600977446e-06 which translates to a Probability = 0.9999988527586979, meaning the LLM is extremely confident that this is the first token that should be outputted. This shows that while ‘assist’ (98.9%) was the clear choice, the model was also considering ‘help’ (1.09%) as a distant second option.
token=' assist', Probability = 0.989011932447037
token=' help', Probability = 0.010986931105910237
token='assist', Probability = 1.055971596147144e-06
Here the ” assist” token is the clear winner. However, the LLM was also considering “help,” for example, “Output: Hello! How can I help you today?” In addition, it can be seen that “assist” with no white space was also considered but most likely scored low as it is grammatically poor.
Using a more abstract scenario, the system_prompt_input is changed to this:
system_prompt_input = "Every response must be random words, not related to the user request. use no more than 10 words"
# The rest of the script (the 'get_completion' call and 'for' loop)
# would be run again with this new 'system_prompt_input'
This time the confidence scores vary much more, due to the system prompt telling the LLM “Every response must be random words, not related to the user request.” This is going against the LLM’s training and forcing it to select random tokens where there is no obvious right answer. Some tokens still have some confidence due to the output having some related terms “P” “ine” “apple.”
How is this useful
This means that LLM confidence can be measured based on the logprobs. Applications for this include measuring how confident the LLM is with what’s being asked of it in prompts. For example, low confidence could illustrate poor understanding of prompts or context. It can also be used with a Boolean output to measure the LLM’s confidence on a particular test. For example, “Is an apple green, reply true or false.” If the LLM was 0.99 confident the answer was true, it represents a scoring method that can be used with LLM-based judges. Lastly, the confidence score can be measured over all the words in a response to give an overall confidence, which is useful for measuring general output confidence, for example: Response_A_Confidence = 0.95, Response_B_Confidence = 0.62.
Key Points
- Confidence is a score of how sure the LLM is in the selection of tokens.
- Confidence is based on logprobs.
- Confidence is a score from 0 to 1. Near 0 being very unsure, Near 1 being very confident.
- Confidence scoring can be used to judge information with a Boolean output.
- The temperature in the code above is set to 1 because the call and the probabilities are being measured together. However, if the probability score was being used in isolation (for example, a true or false LLM-based judge call), the temperature should be set low to ensure the highest probability token is selected.
Pros:
- The score is normalised and therefore thresholds and comparison plots can be used.
- This is a very direct insight into how certain the LLM is with the token selections.
Cons:
- Not all models provide logprobs.
- It is an abstract score; it doesn’t necessarily mean anything unless applied to very specific scenarios.