Semantic Measurements

Key Idea: Scoring the meaning of natural language strings in a measurable way.

Oct 2025 10 min

Key Idea: Scoring the meaning of natural language strings in a measurable way.

What is it

Semantic measurements allow the measurement of meaning, not just what words are the same as in ROUGE and BLEU, but what those words mean individually and within the context of the sentence, paragraph, or corpus they are within.

This is useful when testing AI as it can give a numerical value for the meaning of a response when comparing non-deterministic outputs against a known good answer.

How does it work

Semantic measurements work by calculating the cosine of the angle between two vectors in an embedding space. In the example below, the vectors are retrieved from the OpenAI text-embedding-3-small endpoint, which retrieves vectors of 1536 dimensions. What this means in plain speech is: 2 vectors are retrieved by embedding the 2 different inputs, and then the angle between them is measured to show if their direction is similar. A similar direction means closer in meaning.

The scoring is between -1 and 1:

Note: Most comparisons will score between 0 and 1, as opposite vectors are less common than unrelated ones.

This is a very brief and high-level description. It’s worth reading more into this area of machine learning, as it is the basis for how transformer-based AIs work and the basis for vector databases.

Code example

In the example below, semantic scoring is used to check something very simple: does the LLM model (mocked in this case) behave in the way it’s expected to? In this case, in response to a simple word: “apple”

from scipy.spatial import distance
from openai import OpenAI

# Placeholder for OpenAI API Key
OPENAI_API_KEY="YOUR-API-KEY-HERE"

def embed_the_utterance(embedding_input):
    client = OpenAI(api_key=OPENAI_API_KEY)
    response = client.embeddings.create(
        input=embedding_input,
        model="text-embedding-3-small"
    )
    embedded_response = response.data[0].embedding
    return embedded_response

def compare_the_scores(embedded_test_phrase, embedded_utterance):
    return 1 - distance.cosine(embedded_test_phrase, embedded_utterance)

utterance_one = "apple"
utterance_two = ["orange", "computer", "phone", "green", "metal", "code"]

utterance_one_embedded = embed_the_utterance(utterance_one)

for word in utterance_two:
    word_embedded = embed_the_utterance(word)
    print(f"Similarity between '{utterance_one}' and '{word}': {compare_the_scores(utterance_one_embedded, word_embedded)}")

The output from the script should look something like this:

Similarity between 'apple' and 'orange': 0.47139255974653094
Similarity between 'apple' and 'computer': 0.4114103979091204
Similarity between 'apple' and 'phone': 0.46574072075653006
Similarity between 'apple' and 'green': 0.3249099659569473
Similarity between 'apple' and 'metal': 0.2428737864023377
Similarity between 'apple' and 'code': 0.2867175305510137

What does this mean

Each individual score represents how similar in meaning the 2 words are, starting from “Orange,” which, due to it being an edible fruit, is scored quite similarly. However, it is citrus and therefore quite different to an apple, all the way through to “metal,” which is much more dissimilar in meaning.

However, it can be seen that “phone” and “computer” are also seen as being similar in meaning due to the brand Apple. Therefore, the embedding for ‘apple’ successfully captures its multiple, distinct semantic meanings (fruit and brand) within a single vector.

Using a more real scenario where we can measure something that an LLM might actually say the code can be changed to this:

utterance_one = "Hello, how are you?"
utterance_two = ["Hello, how are you?", "Hi", "Give me a hug", "good bye", "I'm ignoring you", "I hate you"]

# The rest of the script (the 'for' loop) would be run with these new inputs

Which then gives this output:

Similarity between 'Hello, how are you?' and 'Hello, how are you?': 0.9999992636954766
Similarity between 'Hello, how are you?' and 'Hi': 0.5971280269091154
Similarity between 'Hello, how are you?' and 'Give me a hug': 0.291963905504095
Similarity between 'Hello, how are you?' and 'good bye': 0.4042506128722889
Similarity between 'Hello, how are you?' and 'I'm ignoring you': 0.3004746029260772
Similarity between 'Hello, how are you?' and 'I hate you': 0.23264604758339613

Again, it can be seen that greetings similar to “Hello, how are you?” are scored highly. It can also be seen that “good bye,” while a different action from ‘hello,’ is scored highly. This is because both words belong to the same semantic category of greetings/salutations.

How is this useful

This means that LLM output meanings can quickly be assessed and compared using embeddings and cosine similarity. Hundreds of thousands of LLM outputs can be compared against known good outcomes. The embedding results are deterministic as long as the embedding model remains consistent; therefore, thresholds can be set and plots of semantic similarity can be compared between releases for regression purposes. An example of this might be to track the LLM behaviour changes between releases or changes to prompting that could affect tone of voice or context.

Key Points

Pros:

Cons: