Honest writing about testing AI, from someone who does it.
This blog exists because I kept running into the same problem: most of what's written about AI evaluation is either too theoretical to apply or too shallow to trust.
I write about what actually works when you're trying to evaluate AI systems in production, the methods, the trade-offs, and the stuff that doesn't fit neatly into a benchmark score. Behavioural testing, risk framing, evaluation strategy, and the gaps between what models promise and what they deliver.
Nothing here is intended as a one-size-fits-all answer. The strategies and frameworks I share are meant to be adapted, challenged, and extended, not followed blindly.
If there's a thread that runs through all of it, it's this: risk in AI systems can't be eliminated. It can only be understood, measured, and managed. I'd rather write honestly about that reality than pretend it doesn't exist.
Where this comes from
The ideas I write about here didn't come from textbooks. They came from building and operating a real AI product in production, where non-determinism, emergent behaviour, and incomplete observability were daily realities, not academic talking points.
Traditional pass/fail testing kept falling short. Evaluations had to account for uncertainty, statistical risk, shifting data distributions, and the fact that some failures were simply inevitable. Most of what I publish here was developed and stress-tested under those conditions.
I can't reference the specific product or organisation, but the challenges of production risk, user impact, model drift, and system-level behaviour are ones that most teams shipping AI will recognise.
Who writes this
James Pearce
howtotestai.com
I started this blog to write about the gap between how we talk about AI evaluation and how it actually works in practice.
I've spent over 15 years in software testing and quality engineering, with the last several focused almost entirely on evaluating complex AI and ML-driven products in production.
That's meant building evaluation strategies, tooling, and processes to assess system behaviour, manage risk, and support decisions under genuine uncertainty, balancing statistical confidence against cost, user impact, and organisational risk.
I believe that objective evaluation, clear risk framing, and honest acknowledgement of what we don't know are non-negotiable if we want trustworthy AI systems. That's what I try to write about here.
What I hold myself to
A few commitments that guide what I publish and how I publish it.
Integrity & Objectivity
Independence. All research and guidance is produced without vendor influence or commercial pressure.
Full transparency. Any vested interests or tooling relationships are clearly stated.
Clear labelling. We distinguish between evidence-based guidance and opinion or perspective.
Funding model. This is a self-funded initiative. Affiliate links, where used, are disclosed and never influence conclusions.
Content & Quality
Human-first content. All analysis and writing is human-authored.
Accuracy over completeness. Content prioritises correctness and applicability over breadth.
A living document. Methods and guidance evolve as the field changes.
Community & Feedback
Accessibility. Concepts are explained clearly, without assuming perfect systems or outcomes.
Open to correction. Errors and disagreements are expected and welcomed.
Privacy. Personal data is respected and never sold.
A note on using this work: The methods and concepts presented here do not eliminate risk. They are intended to help teams understand, prioritise, and manage it. Any reuse of this work should include clear attribution.