Semantic Measurements

Scoring the meaning of natural language strings in a measurable way, so test outputs can be compared on what they mean rather than what they literally say.

Oct 2025 10 min

Key Idea: Scoring the meaning of natural language strings in a measurable way.

The test that fails for the wrong reason

You write a test. The expected response is “Hello, how are you?” The model returns “Hi!” The test fails. The output is fine. The assertion is wrong, because string equality is asking a question that does not fit the system being tested.

This is the gap semantic measurements can be used for. Instead of asking “is the output identical,” you ask “does the output mean the same thing.” That is a question a non-deterministic system can actually answer.

What it is

Semantic measurements allow the measurement of meaning, not just whether words match the way they do in metrics like ROUGE and BLEU. They look at what those words mean individually and within the context of the sentence, paragraph, or corpus they sit in.

For testing AI, this means a numerical value can be assigned to the meaning of a response, and that value can be compared against a known good answer even when the wording is completely different.

How it works

Semantic measurements work by calculating the cosine of the angle between two vectors in an embedding space. In the example below, the vectors are retrieved from the OpenAI text-embedding-3-small endpoint, which returns vectors of 1,536 dimensions. In plain terms, two pieces of text are converted into numerical vectors, and the angle between those vectors tells you how similar in direction they point. Closer direction means closer meaning.

The scoring runs from -1 to 1:

Note: Most comparisons score between 0 and 1, because true semantic opposites are rarer than simple unrelated pairs.

The above is brief and high level. The underlying mechanism is the basis for transformer models and vector databases, and is worth reading into properly.

Code example

In the example below, semantic scoring is used to check something very simple. Does the model behave in the way it is expected to in response to a single word: “apple.”

from scipy.spatial import distance
from openai import OpenAI

# Placeholder for OpenAI API Key
OPENAI_API_KEY="YOUR-API-KEY-HERE"

def embed_the_utterance(embedding_input):
    client = OpenAI(api_key=OPENAI_API_KEY)
    response = client.embeddings.create(
        input=embedding_input,
        model="text-embedding-3-small"
    )
    embedded_response = response.data[0].embedding
    return embedded_response

def compare_the_scores(embedded_test_phrase, embedded_utterance):
    return 1 - distance.cosine(embedded_test_phrase, embedded_utterance)

utterance_one = "apple"
utterance_two = ["orange", "computer", "phone", "green", "metal", "code"]

utterance_one_embedded = embed_the_utterance(utterance_one)

for word in utterance_two:
    word_embedded = embed_the_utterance(word)
    print(f"Similarity between '{utterance_one}' and '{word}': {compare_the_scores(utterance_one_embedded, word_embedded)}")

The output from the script should look something like this:

Similarity between 'apple' and 'orange': 0.47139255974653094
Similarity between 'apple' and 'computer': 0.4114103979091204
Similarity between 'apple' and 'phone': 0.46574072075653006
Similarity between 'apple' and 'green': 0.3249099659569473
Similarity between 'apple' and 'metal': 0.2428737864023377
Similarity between 'apple' and 'code': 0.2867175305510137

What this means

Each score represents how close in meaning two words are. “Orange” scores highest because it is also an edible fruit, although citrus rather than pome, so the score is not as high as a closer match would be. The scores then descend through “metal,” which is much more semantically distant.

The interesting result is “phone” and “computer,” which both score similarly to “orange.” That is because the embedding for “apple” has captured both meanings of the word: the fruit and the brand. A single vector is holding two distinct semantic identities at once, which is a useful thing to see and a dangerous thing to forget when designing tests.

A more realistic scenario, closer to what an LLM might actually output, is the greeting case:

utterance_one = "Hello, how are you?"
utterance_two = ["Hello, how are you?", "Hi", "Give me a hug", "good bye", "I'm ignoring you", "I hate you"]

# The rest of the script (the 'for' loop) would be run with these new inputs

Which gives this output:

Similarity between 'Hello, how are you?' and 'Hello, how are you?': 0.9999992636954766
Similarity between 'Hello, how are you?' and 'Hi': 0.5971280269091154
Similarity between 'Hello, how are you?' and 'Give me a hug': 0.291963905504095
Similarity between 'Hello, how are you?' and 'good bye': 0.4042506128722889
Similarity between 'Hello, how are you?' and 'I'm ignoring you': 0.3004746029260772
Similarity between 'Hello, how are you?' and 'I hate you': 0.23264604758339613

Greetings score highly against the original. “Good bye” also scores reasonably high, because it sits in the same semantic category of salutations even though the action is the opposite. This is the kind of result that catches people out the first time. Semantic similarity measures word meaning proximity, not agreement.

How this is useful

LLM output meanings can be assessed and compared using embeddings and cosine similarity at scale. Hundreds of thousands of outputs can be compared against known good answers in a single run. The embedding results are deterministic as long as the embedding model stays consistent, which means thresholds can be set and trends can be tracked across releases.

A practical example. Track semantic similarity between live outputs and a set of golden responses, plotted over time. A drift in the trend signals that prompt changes, model updates, or context modifications have shifted the system’s behaviour. The drift is visible long before any individual output looks obviously wrong.

Key points

Pros

Cons

Final Thoughts

Back to the original test, the tests can now used to score if the meaning if the output is similar to a gold standard answer. It can even be used to map how close in meaning a response is to many different answer types. This score can then be tracked as a trend over time or a threshold can be applied to add a pass or fail state, making the testing much more robust to decay as the prompting improves over time.