Honest writing about testing AI, from someone who does it.

This blog exists because I kept running into the same problem: most of what's written about AI evaluation is either too theoretical to apply or too shallow to trust.

I write about what actually works when you're trying to evaluate AI systems in production, the methods, the trade-offs, and the stuff that doesn't fit neatly into a benchmark score. Behavioural testing, risk framing, evaluation strategy, and the gaps between what models promise and what they deliver.

Nothing here is intended as a one-size-fits-all answer. The strategies and frameworks I share are meant to be adapted, challenged, and extended, not followed blindly.

If there's a thread that runs through all of it, it's this: risk in AI systems can't be eliminated. It can only be understood, measured, and managed.

Who writes this

James Pearce

howtotestai.com

I started this blog to write about the gap between how we talk about AI evaluation and how it actually works in practice.

I've spent over 15 years in software testing and quality engineering, with the last several focused almost entirely on evaluating complex AI and ML-driven products in production.

That's meant building evaluation strategies, tooling, and processes to assess system behaviour, manage risk, and support decisions under genuine uncertainty, balancing statistical confidence against cost, user impact, and organisational risk.

I believe that objective evaluation, clear risk framing, and honest acknowledgement of what we don't know are non-negotiable if we want trustworthy AI systems. That's what I try to write about here.

What I hold myself to

A few commitments that guide what I publish and how I publish it.

Integrity & Objectivity

Independence. All research and guidance is produced without vendor influence or commercial pressure.

Full transparency. Any vested interests or tooling relationships are clearly stated.

Clear labelling. We distinguish between evidence-based guidance and opinion or perspective.

Funding model. This is a self-funded initiative. Affiliate links, where used, are disclosed and never influence conclusions.

Content & Quality

Human-first content. All analysis and writing is human-authored.

Accuracy over completeness. Content prioritises correctness and applicability over breadth.

A living document. Methods and guidance evolve as the field changes.

Community & Feedback

Accessibility. Concepts are explained clearly, without assuming perfect systems or outcomes.

Open to correction. Errors and disagreements are expected and welcomed.

Privacy. Personal data is respected and never sold.

A note on using this work: The methods and concepts presented here do not eliminate risk. They are intended to help teams understand, prioritise, and manage it. Any reuse of this work should include clear attribution.