What is it
Observability is the practice of collecting input, output, and system data to help with root cause analysis and behavioral debugging. While this is typically done through logging in traditional systems, in LLM systems, logging raw outputs can quickly lead to data bloat, obscuring critical signals. Therefore, it is important to capture model responses and metadata as structured data in specialised storage.
Why it matters
How can we diagnose and fix behavioral problems from a black box?
Large Language Models (LLMs) are often treated as black boxes. Their outputs are non-deterministic, and failures are frequently hard to reproduce. When an LLM behaves unexpectedly, it is hard to track down the source of the issue:
- Was the issue caused by the prompt or the context?
- Was it the surrounding system logic (e.g., retrieval, tools)?
- Was it model randomness (a key aspect of non-determinism)?
Without sufficient observability, diagnosing and fixing behavioral issues becomes guesswork and trial-and-error. Small changes in prompts, context size, or ordering can lead to large behavioral shifts. Without historical, structured data:
- Bugs cannot be reliably reproduced.
- Improvements cannot be validated.
- Regressions (unintended re-emergence of old bugs) go unnoticed.
Observability turns the subjective assessment of “the model feels better/worse” into an evidence-based diagnosis and provides real data for deterministic testing.
Strategy Summary
While it is impossible to fully inspect most LLM’s internal weights or reasoning, the system can be effectively turned into a white box at the interface level by collecting and retaining input, output, system data, and decision-level data.
By capturing this structured data, we enable:
- Root cause analysis of failures
- Behavioral debugging
- Regression detection
- More objective system evaluation over time
- Evaluation repeatability
What Should Be Collected
From the System (The Input Layer)
- Full input requests
- Conversation history (exact ordering matters)
- Context assembly logic (retrieved documents, tools, and the data passed to the model)
- System prompt (including hidden or injected instructions)
- Any other calls to LLMs or models used to provide data to the main LLM
- Any guardrailing outputs or scoring mechanisms used in the system
From the LLM (The Output Layer)
- Raw, unaltered model outputs
- Token usage metrics
- Log probabilities (when available, crucial for understanding non-determinism)
- Model configuration (temperature, model version)
Generated / Derived Metrics
- Latency
- Tool-call success rates
- Agentic actions that might have been carried out
- Custom behavioral metrics (e.g., refusal rate, verbosity score)
Approach to Collecting Data (Guidelines)
- Always Preserve Outputs: Never overwrite or post-process the original response without first preserving it.
- Capture Metrics by Default, Prune Later: Collect broadly at first. Over time, refine which signals are actually useful.
- Version Everything: Prompts, models, evaluation logic, and metrics should all be versioned to support accurate comparisons.
- Prioritise Signal Over Noise: Excessive metrics can obscure the signals that matter most. Define your key performance indicators (KPIs) early.
- Be Pragmatic: Optimise for data that supports root cause analysis and validation, not just curiosity.
- Address Cost and Compliance: Logging everything carries significant storage cost and compliance implications (especially concerning PII/GDPR). Design your storage and retention policies upfront.
Where to Start
Start small by storing the prompt data in a structured format for easy retrieval. This can later be turned into a full tracking pipeline (an automated system for capturing and processing data) or just preserved as data that can be retrieved via an API.
Below is a JSON-based example of an output. These are human readable, easily interacted with from a test automation or data analysis perspective and can be stored as structured objects in a database.
Key benefits this JSON demonstrates:
- Version Control: The system_version and system_prompt_version ensure the input and logic are traceable.
- Reproducibility: All parameters in model_config are captured.
- Root Cause Analysis Focus: The context_ordering_type is critical to understanding the positional bias found in your previous example.
- Structured Analysis: The derived_metrics and evaluation_labels sections are clean areas for statistical analysis and test thresholds can be applied to them.
{
"request_id": "rca-fruit-v2-001234",
"timestamp": "2025-12-16T09:30:00Z",
"system_version": "fruit-seller-agent-v2.1",
"model_config": {
"model_name": "gpt-4.1-mini",
"temperature": 1.0
},
"input_data": {
"user_query": "hello, I'd like some fruit recommendations",
"system_prompt_version": "fruit-seller-v3",
"system_prompt": "You are a helpful fruit seller...",
"context_received": [
{"fruit": "Apple", "description_length": 150},
{"fruit": "Banana", "description_length": 140}
],
"context_ordering_type": "static_alphabetical"
},
"llm_output": {
"raw_response": "Try a crisp apple for a sweet or tart snack, or a juicy orange to refresh and boost your vitamin C!",
"token_metrics": {
"prompt_tokens": 580,
"cached_tokens": 0,
"response_tokens": 30,
"total_tokens": 610,
"completion_tokens": 30
}
},
"derived_metrics": {
"latency_ms": 450,
"fruit_mentions": {
"apple": 1,
"orange": 1,
"banana": 0
},
"custom_metric_verbosity_score": 30,
"perplexity": 12.5
}
}