Observability

Turning the LLM black box into a white box. Observability is the practice of collecting input, output, and system data to enable root cause analysis and behavioural debugging.

Dec 2025 8 min

What is it

Observability is the practice of collecting input, output, and system data to help with root cause analysis and behavioral debugging. While this is typically done through logging in traditional systems, in LLM systems, logging raw outputs can quickly lead to data bloat, obscuring critical signals. Therefore, it is important to capture model responses and metadata as structured data in specialised storage.

Why it matters

How can we diagnose and fix behavioral problems from a black box?

Large Language Models (LLMs) are often treated as black boxes. Their outputs are non-deterministic, and failures are frequently hard to reproduce. When an LLM behaves unexpectedly, it is hard to track down the source of the issue:

Without sufficient observability, diagnosing and fixing behavioral issues becomes guesswork and trial-and-error. Small changes in prompts, context size, or ordering can lead to large behavioral shifts. Without historical, structured data:

Observability turns the subjective assessment of “the model feels better/worse” into an evidence-based diagnosis and provides real data for deterministic testing.

Strategy Summary

While it is impossible to fully inspect most LLM’s internal weights or reasoning, the system can be effectively turned into a white box at the interface level by collecting and retaining input, output, system data, and decision-level data.

By capturing this structured data, we enable:

What Should Be Collected

From the System (The Input Layer)

From the LLM (The Output Layer)

Generated / Derived Metrics

Approach to Collecting Data (Guidelines)

Where to Start

Start small by storing the prompt data in a structured format for easy retrieval. This can later be turned into a full tracking pipeline (an automated system for capturing and processing data) or just preserved as data that can be retrieved via an API.

Below is a JSON-based example of an output. These are human readable, easily interacted with from a test automation or data analysis perspective and can be stored as structured objects in a database.

Key benefits this JSON demonstrates:

{
  "request_id": "rca-fruit-v2-001234",
  "timestamp": "2025-12-16T09:30:00Z",
  "system_version": "fruit-seller-agent-v2.1",

  "model_config": {
    "model_name": "gpt-4.1-mini",
    "temperature": 1.0
  },

  "input_data": {
    "user_query": "hello, I'd like some fruit recommendations",
    "system_prompt_version": "fruit-seller-v3",
    "system_prompt": "You are a helpful fruit seller...",
    "context_received": [
      {"fruit": "Apple", "description_length": 150},
      {"fruit": "Banana", "description_length": 140}
    ],
    "context_ordering_type": "static_alphabetical"
  },

  "llm_output": {
    "raw_response": "Try a crisp apple for a sweet or tart snack, or a juicy orange to refresh and boost your vitamin C!",
    "token_metrics": {
      "prompt_tokens": 580,
      "cached_tokens": 0,
      "response_tokens": 30,
      "total_tokens": 610,
      "completion_tokens": 30
    }
  },

  "derived_metrics": {
    "latency_ms": 450,
    "fruit_mentions": {
      "apple": 1,
      "orange": 1,
      "banana": 0
    },
    "custom_metric_verbosity_score": 30,
    "perplexity": 12.5
  }
}