Semantic Measurements, How To Test AI

Key Idea: Scoring the meaning of natural language strings in a measurable way.

The test that fails for the wrong reason

You write a test. The expected response is “Hello, how are you?” The model returns “Hi!” The test fails. The output is fine. The assertion is wrong, because string equality is asking a question that does not fit the system being tested.

This is the gap semantic measurements can be used for. Instead of asking “is the output identical,” you ask “does the output mean the same thing.” That is a question a non-deterministic system can actually answer.

What it is

Semantic measurements allow the measurement of meaning, not just whether words match the way they do in metrics like ROUGE and BLEU. They look at what those words mean individually and within the context of the sentence, paragraph, or corpus they sit in.

For testing AI, this means a numerical value can be assigned to the meaning of a response, and that value can be compared against a known good answer even when the wording is completely different.

How it works

Semantic measurements work by calculating the cosine of the angle between two vectors in an embedding space. In the example below, the vectors are retrieved from the OpenAI text-embedding-3-small endpoint, which returns vectors of 1,536 dimensions. In plain terms, two pieces of text are converted into numerical vectors, and the angle between those vectors tells you how similar in direction they point. Closer direction means closer meaning.

The scoring runs from -1 to 1:

1. The vectors are identical, pointing in the same direction.
0. The vectors are orthogonal, with no semantic relationship. Think “apple” and “quantum physics.”
-1. The vectors are opposites, pointing in opposite directions. Think “good” and “bad.”

Note: Most comparisons score between 0 and 1, because true semantic opposites are rarer than simple unrelated pairs.

The above is brief and high level. The underlying mechanism is the basis for transformer models and vector databases, and is worth reading into properly.

Code example

In the example below, semantic scoring is used to check something very simple. Does the model behave in the way it is expected to in response to a single word: “apple.”

from scipy.spatial import distance
from openai import OpenAI

# Placeholder for OpenAI API Key
OPENAI_API_KEY="YOUR-API-KEY-HERE"

def embed_the_utterance(embedding_input):
    client = OpenAI(api_key=OPENAI_API_KEY)
    response = client.embeddings.create(
        input=embedding_input,
        model="text-embedding-3-small"
    )
    embedded_response = response.data[0].embedding
    return embedded_response

def compare_the_scores(embedded_test_phrase, embedded_utterance):
    return 1 - distance.cosine(embedded_test_phrase, embedded_utterance)

utterance_one = "apple"
utterance_two = ["orange", "computer", "phone", "green", "metal", "code"]

utterance_one_embedded = embed_the_utterance(utterance_one)

for word in utterance_two:
    word_embedded = embed_the_utterance(word)
    print(f"Similarity between '{utterance_one}' and '{word}': {compare_the_scores(utterance_one_embedded, word_embedded)}")

The output from the script should look something like this:

Similarity between 'apple' and 'orange': 0.47139255974653094
Similarity between 'apple' and 'computer': 0.4114103979091204
Similarity between 'apple' and 'phone': 0.46574072075653006
Similarity between 'apple' and 'green': 0.3249099659569473
Similarity between 'apple' and 'metal': 0.2428737864023377
Similarity between 'apple' and 'code': 0.2867175305510137

What this means

Each score represents how close in meaning two words are. “Orange” scores highest because it is also an edible fruit, although citrus rather than pome, so the score is not as high as a closer match would be. The scores then descend through “metal,” which is much more semantically distant.

The interesting result is “phone” and “computer,” which both score similarly to “orange.” That is because the embedding for “apple” has captured both meanings of the word: the fruit and the brand. A single vector is holding two distinct semantic identities at once, which is a useful thing to see and a dangerous thing to forget when designing tests.

A more realistic scenario, closer to what an LLM might actually output, is the greeting case:

utterance_one = "Hello, how are you?"
utterance_two = ["Hello, how are you?", "Hi", "Give me a hug", "good bye", "I'm ignoring you", "I hate you"]

# The rest of the script (the 'for' loop) would be run with these new inputs

Which gives this output:

Similarity between 'Hello, how are you?' and 'Hello, how are you?': 0.9999992636954766
Similarity between 'Hello, how are you?' and 'Hi': 0.5971280269091154
Similarity between 'Hello, how are you?' and 'Give me a hug': 0.291963905504095
Similarity between 'Hello, how are you?' and 'good bye': 0.4042506128722889
Similarity between 'Hello, how are you?' and 'I'm ignoring you': 0.3004746029260772
Similarity between 'Hello, how are you?' and 'I hate you': 0.23264604758339613

Greetings score highly against the original. “Good bye” also scores reasonably high, because it sits in the same semantic category of salutations even though the action is the opposite. This is the kind of result that catches people out the first time. Semantic similarity measures word meaning proximity, not agreement.

How this is useful

LLM output meanings can be assessed and compared using embeddings and cosine similarity at scale. Hundreds of thousands of outputs can be compared against known good answers in a single run. The embedding results are deterministic as long as the embedding model stays consistent, which means thresholds can be set and trends can be tracked across releases.

A practical example. Track semantic similarity between live outputs and a set of golden responses, plotted over time. A drift in the trend signals that prompt changes, model updates, or context modifications have shifted the system’s behaviour. The drift is visible long before any individual output looks obviously wrong.

Key points

Semantics are not sentiments. Semantic embeddings map meaning regardless of whether the meaning is positive, negative, or neutral. Sentiment is a separate concern.
Semantics measure meaning in the context of surrounding words, not just the word itself.
Semantics are deterministic as long as the embedding model used is consistent.
Semantics are the basis for transformers, vector databases, and many other modern AI components.

Pros

Measuring trends of meaning. Outputs can be tracked against expected behaviour over time.
Robust against non-deterministic outputs because the score depends on meaning, not specific wording.
Provides a normalised score between -1 and 1 that supports thresholds, plots, and pass/fail decisions.

Cons

A “gold standard” set of expected responses is needed for meaningful comparison. If no known good answers are available, getting started is difficult.
If the known good answers are poor, the comparison scoring is compromised in turn.
Known answers can become outdated quickly as the system evolves.
A high similarity score does not guarantee correctness, only relatedness. The “apple” example shows the trap directly. A response about a phone could score unexpectedly high against a golden answer about fruit if the word “apple” is involved, which complicates category-specific tests.

Final Thoughts

Back to the original test, the tests can now used to score if the meaning if the output is similar to a gold standard answer. It can even be used to map how close in meaning a response is to many different answer types. This score can then be tracked as a trend over time or a threshold can be applied to add a pass or fail state, making the testing much more robust to decay as the prompting improves over time.

Semantic Measurements