Is There a Role for AI Quality Engineers?, How To Test AI

AI is going to be more and more integrated into the fabric of all software applications. Some applications will let users interface with the model directly. Some will use AI to make orchestration decisions. Some will use AI to augment existing human processes. The shape varies. The fact that AI is in the stack is increasingly fixed.

Which raises the question. How will this affect the existing software world, and what types of engineering roles might be needed to support these systems? In the specialised AI world we already see what was the domain of the Data Scientist and Machine Learning Engineer expanding to take in Backend Engineers, Data Engineers, and Quality Engineers. New roles have already been defined and adopted, the most visible of which is the AI Engineer, a role specialised in working with and deploying AI-based products.

This article looks at one such role, asks whether it is sustainable as a single job, and considers where Quality Engineering fits in the answer.

An example AI Engineer job

The job description below is taken from a recent listing. Specific company details have been removed, and it is used here as an example only.

Responsibilities:

Develop prompt templates and apply retrieval strategies to enhance answer precision and factual accuracy.
Engage in error analysis, iterative model refinement, and performance optimization.
Responsible for building, fine-tuning, and evaluating models powered by LLMs, using frameworks such as AWS Bedrock, HuggingFace, or the OpenAI API.
Work closely with senior engineers and product managers to translate user needs into technical specifications and features.
Contribute to the creation of data pipelines for training and testing, including necessary annotation and evaluation tools.
Maintain clear documentation for code and workflows, adhering to best practices for code quality and reproducibility.

Breaking it down

The spec covers a lot of different specialisms, and it is much broader than a traditional software engineering role. If the traditional roles that map onto each responsibility were added to the profile, it would look something like this:

Build, fine-tune, and evaluate models. (Machine Learning Engineer)
Create and optimise prompts. (Data Scientist, Machine Learning Engineer, Linguist)
Work with RAG and context retrieval strategies. (Data Scientist, Data Engineer)
Create data pipelines. (Data Engineer)
Create technical specifications for features. (Systems Engineer)
Analyse errors and optimise model and system performance. (Quality Engineer, Backend Developer, Data Scientist, Machine Learning Engineer)
Write and maintain good code and documentation. (Backend Developer, Quality Engineer)

That is a lot of generalism in one job. It is also a lot of unique specialisms that engineers and scientists spend whole careers developing. Code quality alone is a discipline. Context retrieval is an expanding and increasingly complex field of its own. Asking one role to cover all of it is asking for someone who is competent everywhere, expert nowhere, and accountable for results in domains that take years to learn properly.

The rise of hybrid specialists

The reasonable inference is that one of two things will happen. Either AI Engineers will develop personal specialisms within the broad role, or, more likely, the industry will define official specialisms as hybrids of existing roles. Something along these lines:

AI Backend Engineer. Responsible for code quality and complex backend problems on AI systems.
AI Data Engineer. Responsible for data pipelines, databases, and context retrieval.
AI Model Engineer. Responsible for fine-tuning and model optimisation.
AI Quality Engineer. Responsible for evaluation and performance of AI systems.

New role types are also likely to emerge that have no obvious traditional counterpart. Conversation Engineer. AI Evaluation Engineer. Prompt Architect. The names will settle over time, but the specialisms underneath them are already real.

Defining the AI Quality Engineer

Since the focus of this site is testing and quality, let’s sketch out what an AI Quality Engineer role might actually look like and what the responsibilities could be.

Owning the testing strategy for evaluating LLM-driven applications, using or building evaluation frameworks.
Identifying and managing test data sets to evaluate changes in models, prompts, or context.
Contributing critical analysis of design choices, prompting, and context retrieval.
Designing and implementing test techniques for exploratory and regression testing of LLM-based systems.
Carrying out root cause analysis and risk analysis on identified issues.
Building benchmarks and evals.
Leading human-in-the-loop campaigns for labelling and root cause analysis tasks.

That list is recognisable as quality engineering, applied to a different kind of system. The skills transfer. The mindset transfers. What changes is the system being tested, which is non-deterministic, statistical, and rarely failing in ways that produce a clean stack trace. That changes the techniques without changing the discipline.

Generalist or specialist

The honest counter-argument is whether a generalist is good enough. Will engineers be so heavily augmented by AI tooling that a generalist can apply specialist techniques under AI guidance? It is a fair question.

The answer probably depends on the size of the project. Small, simple projects are often built and run by generalist developers, and that has always worked. As with anything in engineering, when the design scales and the complexity grows, the need for specialism grows alongside it. Performance is easy to test on a system with one service doing one task. It is far harder when there are a hundred services working together with large volumes of data.

As AI supporting infrastructure grows in scale and complexity, the specialisms have to grow with it. The job spec at the top of this article is what it looks like when an industry is still pretending that one person can cover the whole stack. The hybrid roles are what emerge once that pretence stops being practical.

That is where the AI Quality Engineer role comes in. The discipline already exists. The system being tested is new. Someone has to own the question of whether it works, and right now that question is sitting at the bottom of a job description that already asks for too much.

Is There a Role for AI Quality Engineers?

An example AI Engineer job

Responsibilities:

Breaking it down

The rise of hybrid specialists

Defining the AI Quality Engineer

Generalist or specialist

Reducing Risk in Token Cost and Performance in AI Systems

Observability

Statistical Risk