Observability, How To Test AI

The BUG

This scenario is all too common even in traditional software testing. A customer reports an issue, the stakeholders want to understand the issue, how it could have happened, and what we can do to fix it. As a tester you are usually on the frontline of trying to replicate the issue, or at least narrow down the scope of what happened and why. However, the logging’s not helping. There are too many logs, or not the right information to narrow down the scope of the testing. Add on to this we are now testing an AI-based system where we have non-deterministic user inputs and non-deterministic user outputs, the behaviour is not repeatable, and the pressure’s on. What do we do?

This is the everyday reality of working with Large Language Models. The system was already a black box, and on top of that the failures are non-deterministic, infrequent, and rarely reproducible from the report alone. Without a record of what the model actually saw and produced at the moment things went wrong, you are debugging from memory and vibes.

Observability is what closes that gap.

What observability actually is

Observability is the practice of collecting input, output, and system data so that root cause analysis and behavioural debugging are possible at all. In traditional software this is usually done through logging. In LLM systems, raw logs quickly grow beyond useability and bury the signal you actually need, so the work shifts towards capturing model responses and their surrounding metadata as structured data in storage that is built for it.

The goal is simple. Make the system inspectable at the interface level, even if the model itself stays opaque.

Why it matters

When an LLM behaves unexpectedly, the source of the issue is rarely obvious. Was it the prompt? The retrieved context? The surrounding system logic, the tools, the guardrail layer? Or was it just the model just being un predictable.

Without structured data behind every response, you cannot answer that question. You guess, you patch the prompt, and you hope. Small changes in prompts, context size, or ordering can cause large behavioural shifts. Without history to compare against:

Bugs cannot be reliably reproduced.
Improvements cannot be validated.
Regressions slip back in unnoticed.

Observability turns the subjective “the model feels worse this week” into evidence. It also gives you the raw material for deterministic testing later, because every captured interaction is a future test case.

What to collect

The instinct is to log everything and sort it out later. That works for a week and then becomes unmanageable. The better approach is to think in three layers and capture each one deliberately.

From the system, the input layer

This is everything that went into the model on the way in.

The full input request
Conversation history, with exact ordering preserved
The context assembly logic, including retrieved documents, tools, and any data passed to the model
The system prompt, including hidden or injected instructions
Any other model calls that fed data into the main LLM
Any guardrail outputs or scoring mechanisms involved in the response

Ordering matters here more than people expect. If the context was assembled in a different order this time than last time, that is often the whole story.

From the LLM, the output layer

This is what the model produced and the conditions under which it produced it.

The raw, unaltered response
Token usage metrics
Log probabilities where available, since they are one of the few real windows into non-determinism
Model configuration: temperature, model version, anything else that could change between calls

Generated and derived metrics

This is the layer you build on top of the raw capture.

Latency
Tool call success rates
Agentic actions taken during the response
Custom behavioural metrics like refusal rate or verbosity score

The derived layer is where most of the long-term value lives. The raw capture lets you reproduce a single failure. The derived layer lets you see drift across thousands of them.

Guidelines for collecting

These are a few concepts on what to collect and why.

Always preserve outputs. Never overwrite or post-process the original response without keeping the original somewhere. The post-processed version is what your users saw. The raw version is what the model actually said.

Capture broadly, prune later. You will not know which signals matter until you have been debugging real failures for a while. Collect more than you think you need, and refine the schema as patterns emerge.

Version everything. Prompts, models, evaluation logic, and metrics should all be versioned. Comparing behaviour across two weeks is worthless if you cannot tell which prompt version produced which response.

Prioritise signal over noise. Capturing too much is its own failure mode. Define your key indicators early and let the rest sit as supporting context, not as primary signal.

Be pragmatic. Optimise for data that supports root cause analysis and validation. If a metric does not feed one of those two things, it is probably curiosity, and curiosity is expensive at scale.

Address cost and compliance up front. Storing every prompt and every response carries real storage cost and real compliance weight, especially around PII and GDPR. Retention policies should be designed in from the start, not bolted on after the first audit.

Where to start

Start small. Store the prompt data in a structured format that is easy to retrieve. This can grow into a full tracking pipeline later, or stay as captured data exposed through an API. The shape of the record is the thing that matters most early on, because that shape is what every later tool will consume.

Below is a JSON example. It is human readable, easy to interact with from a test automation or data analysis perspective, and stores cleanly as a structured object in a database.

A few things this example demonstrates:

Version control. system_version and system_prompt_version make the input and logic traceable.
Reproducibility. Every parameter in model_config is captured, so the run can be repeated.
Root cause focus. The context_ordering_type field is the kind of small detail that lets you spot positional bias, the same kind covered in the Context Order Bias study.
Structured analysis. The derived_metrics and evaluation sections sit ready for statistical analysis. Test thresholds can be applied directly against them.

{
  "request_id": "rca-fruit-v2-001234",
  "timestamp": "2025-12-16T09:30:00Z",
  "system_version": "fruit-seller-agent-v2.1",

  "model_config": {
    "model_name": "gpt-4.1-mini",
    "temperature": 1.0
  },

  "input_data": {
    "user_query": "hello, I'd like some fruit recommendations",
    "system_prompt_version": "fruit-seller-v3",
    "system_prompt": "You are a helpful fruit seller...",
    "context_received": [
      {"fruit": "Apple", "description_length": 150},
      {"fruit": "Banana", "description_length": 140}
    ],
    "context_ordering_type": "static_alphabetical"
  },

  "llm_output": {
    "raw_response": "Try a crisp apple for a sweet or tart snack, or a juicy orange to refresh and boost your vitamin C!",
    "token_metrics": {
      "prompt_tokens": 580,
      "cached_tokens": 0,
      "response_tokens": 30,
      "total_tokens": 610,
      "completion_tokens": 30
    }
  },

  "derived_metrics": {
    "latency_ms": 450,
    "fruit_mentions": {
      "apple": 1,
      "orange": 1,
      "banana": 0
    },
    "custom_metric_verbosity_score": 30,
    "perplexity": 12.5
  }
}

Final thoughts

Back to unsolvable bug. With observability in place, you have a place to start looking. You pull the failing request by ID, see the exact context the model was handed, the exact prompt version it was running under, the exact response it produced, and the derived metrics for that turn. You compare it against a working interaction from the same day and see that the context supplied to the prompt looks odd, you pull the full stored prompt re-run it 20 times and find that 5 times out of the 20 the bug occurs. You raise a ticket state the issue and give the developers the exact prompt and the number of times to create the issue so they can duplicate it and test the fix.

That is what observability buys you. Not certainty, not a fixed system, just the ability to do the next investigation from evidence instead of from guesswork.

Observability