Language Heuristics Part 1, How To Test AI

Key Idea: Analysing text outputs for verbosity, language complexity and token performance.

The output that sounded fine

A team reviews a series of LLM outputs before release. It looks good, reads cleanly. the change Ships. A week later the tracking starts to show a pattern. User engagement has dropped, no one understands whats happened, other than the engagement dropped when the last release went out. The reviewers are engineers, product managers, and writers, all of whom have a different comprehension than the people the system was actually built for. The output was never wrong. It was just pitched at the wrong audience, and nobody on the team had the measurements to catch it.

Language heuristics give you those measurements. They turn “this reads fine to me” into a number that can be checked against a target.

What it is

This is one of the simplest methods for checking LLM outputs. It is also one of the most effective. It measures how the LLM is interacting with prompt language and context data, analyses word usage, and gives a measurable signal on whether the output is accessible to the intended user.

How it works

Using built-in Python and basic natural language processing techniques, both the inputs to and the outputs from the LLM can be assessed for verbosity, readability, and complexity. The metrics are simple, the libraries are mature, and the cost of running them is effectively zero compared to the model calls they are scoring.

Code example

The example below uses three LLM outputs covering the same subject. The first is a default response with no instruction on language or audience. The second was generated with an instruction to use accessible language aimed at younger audiences. The third was instructed to use language aimed at scientific researchers.

import textstat
import string

llm_output_default = '''
The intricate systems that govern our planet are a testament to the complex interplay of biology, chemistry, and physics. Consider the Amazon rainforest, often called the "lungs of the Earth." This vast ecosystem is a crucible of biodiversity, home to millions of species of insects, plants, fish, and mammals, many yet to be discovered. The process of photosynthesis occurs on an unimaginable scale, absorbing immense quantities of carbon dioxide from the atmosphere and releasing the oxygen we depend on. This delicate biological balance is under constant threat from deforestation, which in turn exacerbes global climate change. Understanding this ecosystem requires not just biology, but also advanced remote sensing technology.
'''

llm_more_accessible_output = '''
Our world is full of amazing connections, like a giant puzzle. Think about the Amazon rainforest! People call it the "lungs of the Earth." It's a huge jungle filled with millions of different bugs, plants, fish, and animals. Lots of them haven't even been discovered yet! The plants and trees do something cool called photosynthesis. They breathe in a gas called carbon dioxide (which we breathe out) and breathe out the oxygen we need to live. But this wonderful place is in danger because people are cutting down the trees. This is bad for the forest and also makes the whole planet get warmer. To watch over the forest, scientists use special cameras from space.
'''

llm_less_accessible_output = '''
The convoluted biogeochemical frameworks that modulate our planet serve as a testament to the multifaceted synergistic interactions of biology, chemistry, and physics. Consider the Amazonian rainforest, colloquially designated the "primary terrestrial biogeochemical engine." This expansive biome functions as a nexus of macro-evolutionary diversification, hosting innumerable taxa of insects, flora, fish, and mammals, a significant portion remaining uncatalogued. The process of photosynthetic carbon fixation proceeds at a prodigious magnitude, sequestering substantial volumes of atmospheric carbon dioxide whilst liberating the diatomic oxygen upon which complex life depends. This precarious homeostatic equilibrium exists under perpetual jeopardy from anthropogenic silvicultural clearing, which in turn amplifies global climatological perturbations. Comprehending this biome necessitates not merely biological sciences, but also sophisticated geospatial surveillance methodologies.
'''

def get_cleaned_words(text):
    """Helper function to get a list of words, lowercase and without punctuation."""
    text = text.replace('-', ' ')
    cleaned_text = text.lower().translate(str.maketrans('', '', string.punctuation))
    words = cleaned_text.split()
    return words

def count_words(text):
    """Counts the total number of words in the text."""
    words = get_cleaned_words(text)
    return len(words)

def count_sentences(text):
    """Counts the number of sentences using textstat for robustness."""
    return textstat.sentence_count(text)

def count_characters(text):
    """Counts the total number of characters, including spaces and punctuation."""
    return len(text)

def count_unique_words(text):
    """Counts the number of unique (distinct) words in the text."""
    words = get_cleaned_words(text)
    unique_words = set(words)
    return len(unique_words)

def average_word_length(text):
    """Calculates the average length of words in the text."""
    words = get_cleaned_words(text)
    total_length = sum(len(word) for word in words)
    return total_length / len(words) if words else 0

def calculate_readability_metrics(text):
    """Calculates Flesch-Kincaid Grade Level and Flesch Reading Ease."""
    grade_level = textstat.flesch_kincaid_grade(text)
    reading_ease = textstat.flesch_reading_ease(text)
    return grade_level, reading_ease

# --- Analysis Section ---
grade_level_default, reading_ease_default = calculate_readability_metrics(llm_output_default)
grade_level_more_accessible, reading_ease_more_accessible = calculate_readability_metrics(llm_more_accessible_output)
grade_level_less_accessible, reading_ease_less_accessible = calculate_readability_metrics(llm_less_accessible_output)

print("LLM Output Default:")
print("Word Count:", count_words(llm_output_default))
print("Sentence Count:", count_sentences(llm_output_default))
print("Character Count:", count_characters(llm_output_default))
print("Unique Word Count:", count_unique_words(llm_output_default))
print("Average Word Length:", average_word_length(llm_output_default))
print("Flesch-Kincaid Grade Level:", grade_level_default)
print("Flesch Reading Ease:", reading_ease_default)

print("\nLLM More Accessible Output Analysis:")
print("Word Count:", count_words(llm_more_accessible_output))
print("Sentence Count:", count_sentences(llm_more_accessible_output))
print("Character Count:", count_characters(llm_more_accessible_output))
print("Unique Word Count:", count_unique_words(llm_more_accessible_output))
print("Average Word Length:", average_word_length(llm_more_accessible_output))
print("Flesch-Kincaid Grade Level:", grade_level_more_accessible)
print("Flesch Reading Ease:", reading_ease_more_accessible)

print("\nLLM Less Accessible Output Analysis:")
print("Word Count:", count_words(llm_less_accessible_output))
print("Sentence Count:", count_sentences(llm_less_accessible_output))
print("Character Count:", count_characters(llm_less_accessible_output))
print("Unique Word Count:", count_unique_words(llm_less_accessible_output))
print("Average Word Length:", average_word_length(llm_less_accessible_output))
print("Flesch-Kincaid Grade Level:", grade_level_less_accessible)
print("Flesch Reading Ease:", reading_ease_less_accessible)

The output from the script looks something like this:

LLM Output Default:
Word Count: 109
Sentence Count: 6
Character Count: 734
Unique Word Count: 84
Average Word Length: 5.55
Flesch-Kincaid Grade Level: 14.34
Flesch Reading Ease: 24.63

LLM More Accessible Output Analysis:
Word Count: 115
Sentence Count: 10
Character Count: 667
Unique Word Count: 85
Average Word Length: 4.61
Flesch-Kincaid Grade Level: 5.83
Flesch Reading Ease: 73.78

LLM Less Accessible Output Analysis:
Word Count: 119
Sentence Count: 6
Character Count: 991
Unique Word Count: 99
Average Word Length: 7.16
Flesch-Kincaid Grade Level: 22.38
Flesch Reading Ease: -30.36

What this means

The three outputs score very differently against the same metrics. The default response sits at a Flesch-Kincaid grade of 14, equivalent to a university-educated reader. The accessible version drops to grade 6, which a primary-school child could read. The scientific version goes the other way to grade 22, which is past the top of the standard scale and effectively unreadable for most people.

Average word length tells the same story in fewer numbers: 4.6 characters for the accessible version, 5.5 for the default, 7.2 for the scientific. Sentence count tells you about pacing: 10 short sentences in the accessible version against 6 long ones in the others.

The point is not that any one of these is the right output. The point is that the prompt instructions, even when they were not deliberate, had a measurable and significant effect on accessibility. Without these metrics, that effect lives in the gut of whoever happens to review the output.

How this is useful

The same metrics serve two jobs. On the prompt side, they catch unintentional complexity. A clear, concise prompt reduces perplexity and token usage. On the output side, they confirm that responses are pitched at the intended audience. A children’s education product needs Flesch-Kincaid scores in single digits. A research tool can comfortably sit in the teens. A consumer support assistant probably wants to land somewhere between the two.

Set a target range, run the metrics on every output, plot the trend over time. Drift outside the band is a signal worth investigating, the same way a drifting evaluation score is.

Key points

A very basic evaluation method that can be implemented with a low skill level.
Strong for testing accessibility and language selection.

Pros

Simple and easy to implement.
Effective in scenarios where word count matters, such as token performance.
Useful for heuristic rules and for narrowing down data sets before human evaluation.
Measures prompt complexity and language usage at the same cost as measuring output.

Cons

Only checks the language characteristics. It does not check whether the language is meaningful, applicable to the request, or factually correct.

Final Thoughts

Back to the example at the start, the measurements above could have been applied to the system that was being tested and given real objective measurements for the engineering team to review before release. On top of that the same evaluations could be run on the user inputs to determine a baseline to compare the pre release against.

Language Heuristics Part 1