Key Idea: Scoring the meaning of natural language strings in a measurable way.
What is it
Semantic measurements allow the measurement of meaning, not just what words are the same as in ROUGE and BLEU, but what those words mean individually and within the context of the sentence, paragraph, or corpus they are within.
This is useful when testing AI as it can give a numerical value for the meaning of a response when comparing non-deterministic outputs against a known good answer.
How does it work
Semantic measurements work by calculating the cosine of the angle between two vectors in an embedding space. In the example below, the vectors are retrieved from the OpenAI text-embedding-3-small endpoint, which retrieves vectors of 1536 dimensions. What this means in plain speech is: 2 vectors are retrieved by embedding the 2 different inputs, and then the angle between them is measured to show if their direction is similar. A similar direction means closer in meaning.
The scoring is between -1 and 1:
- 1: The vectors are identical (point in the same direction).
- 0: The vectors are orthogonal (have no semantic relationship, like “apple” and “quantum physics”).
- -1: The vectors are opposites (point in opposite directions, like “good” and “bad”).
Note: Most comparisons will score between 0 and 1, as opposite vectors are less common than unrelated ones.
This is a very brief and high-level description. It’s worth reading more into this area of machine learning, as it is the basis for how transformer-based AIs work and the basis for vector databases.
Code example
In the example below, semantic scoring is used to check something very simple: does the LLM model (mocked in this case) behave in the way it’s expected to? In this case, in response to a simple word: “apple”
from scipy.spatial import distance
from openai import OpenAI
# Placeholder for OpenAI API Key
OPENAI_API_KEY="YOUR-API-KEY-HERE"
def embed_the_utterance(embedding_input):
client = OpenAI(api_key=OPENAI_API_KEY)
response = client.embeddings.create(
input=embedding_input,
model="text-embedding-3-small"
)
embedded_response = response.data[0].embedding
return embedded_response
def compare_the_scores(embedded_test_phrase, embedded_utterance):
return 1 - distance.cosine(embedded_test_phrase, embedded_utterance)
utterance_one = "apple"
utterance_two = ["orange", "computer", "phone", "green", "metal", "code"]
utterance_one_embedded = embed_the_utterance(utterance_one)
for word in utterance_two:
word_embedded = embed_the_utterance(word)
print(f"Similarity between '{utterance_one}' and '{word}': {compare_the_scores(utterance_one_embedded, word_embedded)}")
The output from the script should look something like this:
Similarity between 'apple' and 'orange': 0.47139255974653094
Similarity between 'apple' and 'computer': 0.4114103979091204
Similarity between 'apple' and 'phone': 0.46574072075653006
Similarity between 'apple' and 'green': 0.3249099659569473
Similarity between 'apple' and 'metal': 0.2428737864023377
Similarity between 'apple' and 'code': 0.2867175305510137
What does this mean
Each individual score represents how similar in meaning the 2 words are, starting from “Orange,” which, due to it being an edible fruit, is scored quite similarly. However, it is citrus and therefore quite different to an apple, all the way through to “metal,” which is much more dissimilar in meaning.
However, it can be seen that “phone” and “computer” are also seen as being similar in meaning due to the brand Apple. Therefore, the embedding for ‘apple’ successfully captures its multiple, distinct semantic meanings (fruit and brand) within a single vector.
Using a more real scenario where we can measure something that an LLM might actually say the code can be changed to this:
utterance_one = "Hello, how are you?"
utterance_two = ["Hello, how are you?", "Hi", "Give me a hug", "good bye", "I'm ignoring you", "I hate you"]
# The rest of the script (the 'for' loop) would be run with these new inputs
Which then gives this output:
Similarity between 'Hello, how are you?' and 'Hello, how are you?': 0.9999992636954766
Similarity between 'Hello, how are you?' and 'Hi': 0.5971280269091154
Similarity between 'Hello, how are you?' and 'Give me a hug': 0.291963905504095
Similarity between 'Hello, how are you?' and 'good bye': 0.4042506128722889
Similarity between 'Hello, how are you?' and 'I'm ignoring you': 0.3004746029260772
Similarity between 'Hello, how are you?' and 'I hate you': 0.23264604758339613
Again, it can be seen that greetings similar to “Hello, how are you?” are scored highly. It can also be seen that “good bye,” while a different action from ‘hello,’ is scored highly. This is because both words belong to the same semantic category of greetings/salutations.
How is this useful
This means that LLM output meanings can quickly be assessed and compared using embeddings and cosine similarity. Hundreds of thousands of LLM outputs can be compared against known good outcomes. The embedding results are deterministic as long as the embedding model remains consistent; therefore, thresholds can be set and plots of semantic similarity can be compared between releases for regression purposes. An example of this might be to track the LLM behaviour changes between releases or changes to prompting that could affect tone of voice or context.
Key Points
- Semantics are not sentiments; semantic embeddings map the meaning regardless of whether it is good, bad, or agnostic. Sentiment is concerned with positive or negative views.
- Semantics measure meaning within the context of the word and the surrounding words.
- Semantics are deterministic as long as the embedding model used is consistent.
- Semantics are the basis for transformers, vector databases, and many other applications.
Pros:
- Measuring trends of meaning: are the outputs the LLM generates close in meaning to how the system is expected to work.
- Robust because they do not rely on specific words, but on the meaning of the words, which is good for non-deterministic outputs.
- Provides a normalised score between -1 and 1 that can be used to plot scores or work with test pass/fail thresholds.
Cons:
- Ideally, “Gold standard” responses are required to get a good scoring comparison. If no known good values are available, it’s difficult to get started.
- If the known good answers are poor, the comparison scoring will be compromised.
- The known answers can quickly become outdated.
- A high similarity score does not guarantee correctness, only relatedness. This can be complex: for example, the word “apple” is semantically related to both “fruit” and “phone.” This means a response about a “phone” could score unexpectedly high against a “golden” answer about “fruit” if the word “apple” is involved, complicating category-specific tests.