Context Order Bias: Why This LLM Only Sold Apples and Oranges, How To Test AI

The two fruits that never sold

In this example we have created a simple LLM driven fruit selling app with five products to sell fruit for a local shop. The chatbot has been told to recommend fruit when asked. Apple, Banana, Orange, Strawberry, Watermelon. The system prompt is clean and straight forward. The behavioural rules are sensible. The kind of setup that looks right on first inspection..

We ran it 10 times with the same user input, “hello, I’d like some fruit recommendations”, and counted which fruits appeared in the responses.

However: Strawberries and watermelons were never recommended. Not once across 10 conversations. Two products in the catalogue may as well not have existed.

The temptation in this situation is to modify the the prompt, ship the fix, and move on. but if we apply some testing and investigate more carefully we can uncover a really interesting side effect of supplying context to an LLM.

The setup

The application used a standard chat completion pattern. A system prompt with product descriptions, behavioural rules, and a user message:

system_prompt_input = '''
You are a helpful assistant that always provides concise
and relevant answers to user queries. You are to try to
sell the user fruit.

Below is your context:
Apple
Description: A crisp and juicy fruit with a rounded shape...

Banana
Description: This is a long, curved fruit with a soft,
creamy white flesh...

Orange
Description: A round citrus fruit known for its tough,
bright orange skin...

Strawberry
Description: A small, heart-shaped fruit that is bright
red when ripe...

Watermelon
Description: A very large, heavy fruit with a thick,
striped green rind...

Rules:
1. Always stay on topic.
2. Avoid unnecessary information.
3. Provide clear and direct responses only.
4. Reply in no more than a sentence.
'''

Typical output looked like this:

Try a crisp apple for a versatile snack, a sweet banana
for energy on the go, or a juicy orange for a refreshing
vitamin C boost!

Every response followed the same pattern. Apples first. Sometimes bananas. Always oranges. Never strawberries. Never watermelons.

The knee-jerk fix, and why it didn’t work

Without root cause analysis, the obvious move is to add an instruction:

“All fruits should be represented equally in sales, including Apples, Oranges, Bananas, Strawberries and Watermelon.”

After this change:

Apple: 10, Banana: 3, Orange: 10, Strawberry: 3, Watermelon: 4

Better, but still heavily skewed. Apples and oranges still dominate. The prompt change masked the symptom without addressing the cause.

At this point, it would be tempting to ship the fix and move on. The numbers improved, the worst case is gone, the tickets close. But this hasn’t uncovered why the bias exists. That means it will resurface, in a different form, the moment the product catalogue changes.

Applying root cause analysis

The process follows six steps: observability (capture inputs and outputs), categorisation (label the failures), pattern identification (find clusters), investigation (analyse structural causes), hypothesis testing (isolate variables systematically), and reporting (document findings for future reference).

Observability

We decomposed the prompt into its structural parts:

Instruction: “You are a helpful assistant… try to sell the user fruit.”
Context: Five product descriptions, ordered Apple → Banana → Orange → Strawberry → Watermelon
Rules: Stay on topic, avoid unnecessary info, clear responses only, reply in no more than a sentence

Categorisation

The failure counts gave us clear labels. Apple (10), Orange (10), Banana (7), Strawberry (0), Watermelon (0). A clean split between “always mentioned” and “never mentioned.”

Pattern identification

The cluster was obvious. The LLM heavily favoured fruits listed earlier in the context. The ordering in the prompt, Apple, Banana, Orange, Strawberry, Watermelon, correlated almost exactly with mention frequency.

Investigation

Two structural elements in the prompt could explain this:

Context ordering. Fruits listed earlier get more attention.
Output length restriction. “Reply in no more than a sentence” forces the model to pick a subset, and it picks from the top.

Hypothesis testing

We built a counting mechanism and increased test iterations to 50 for statistical significance:

fruit_count = {
    "apple": 0, "banana": 0, "orange": 0,
    "strawberry": 0, "watermelon": 0,
}

for response in list_of_responses:
    response_lower = response.lower()
    for fruit_name, root in fruit_search_roots.items():
        if root in response_lower:
            fruit_count[fruit_name] += 1

Then tested each hypothesis in isolation.

Hypothesis 1: Context order bias

Test: Reverse the context order to Watermelon → Strawberry → Orange → Banana → Apple.

50 iterations, reversed context order

Fruit	Original order	Reversed order
Apple	50 (always)	24
Banana	35	0
Orange	50 (always)	13
Strawberry	0 (never)	49
Watermelon	0 (never)	48

Reversing the order flipped the bias almost entirely. Strawberry and watermelon, previously invisible, now dominated. But 24 apples persisted despite being listed last, hinting at a residual bias worth investigating further.

Hypothesis 2: Output length restriction

Test: Remove the rule “reply in no more than a sentence.” Keep original context order.

Apple: 50, Banana: 50, Orange: 49, Strawberry: 46, Watermelon: 49

With the length restriction removed, all five fruits were recommended roughly equally. The model wanted to mention all products. The sentence limit was forcing it to truncate, and truncation favoured items listed first.

This works, but at the cost of longer responses and more tokens. Any fix using this approach needs to weigh verbosity against the business requirement for concise recommendations.

Hypotheses 3 and 4: Isolating apple bias vs. positional dominance

The residual 24 apples from Hypothesis 1 raised a question. Does the model have an inherent preference for apples, perhaps inherited from training data, or is position always the dominant factor?

Hypothesis 3: Ask the model to pick a random fruit from the list Apple, Banana, Orange, Strawberry, Watermelon.

Apple: 0, Banana: 5, Orange: 12, Strawberry: 20, Watermelon: 13

Zero apples, despite being listed first. When explicitly asked to be random, the model avoided the first item. Interesting, and contradictory to the earlier results.

Hypothesis 4: Same test, reversed list, Watermelon, Strawberry, Orange, Banana, Apple.

Apple: 31, Banana: 3, Orange: 15, Strawberry: 1, Watermelon: 0

The last-listed fruit (now Apple) dominated. This confirms positional bias is the primary driver, and the specific fruit doesn’t matter. In the “random pick” framing, the model favoured items at the end of the list rather than the beginning, which is a different but related positional effect.

Root cause report

The knee-jerk fix was a fragile prompt tweak that only masked the symptom. Methodical Root Cause Aanlysis uncovered the real problem, enabling a permanent, robust fix.

Observed behaviour

The LLM consistently recommended apples, oranges, and bananas while ignoring strawberries and watermelons entirely.

Root causes identified

Two interacting biases.

Context order bias. The model strongly favours items listed earlier in the context when generating recommendations.
Output restriction amplification. The “reply in no more than a sentence” rule forces truncation, which reinforces the order bias by cutting off items the model would otherwise have included.

Permanent fixes

Understanding the root cause opens up robust solutions rather than prompt patches.

Randomise context order on each API call. Eliminates positional bias entirely.
Rework the length constraint. Use “mention at least one less common option” rather than forcing a single sentence.
Rotate featured products. Structurally ensure catalogue coverage over time.

The process matters more than this specific example. Context order bias appears in any LLM application with a list of options. Product catalogues, knowledge base articles, FAQ entries, support categories. The same RCA methodology applies.

Final Thoughts

Back to the catalogue. With the order randomised on each call and the length rule reworked, every fruit gets its turn. Strawberries and watermelons sell. The fix is structural, so it survives the next time the catalogue changes. Other solutions to this problem might be to present a programmatic response on the first turn laying out all the fruits or to tell the LLM to present a sumamry of the products on the first turn for the user to then select the ones they want.

Without understanding the root cause the prompt change would have shipped. The numbers would have looked better. And the same bias would have come back wearing a different hat as soon as someone added a sixth fruit. That is what RCA buys you. Not just this fix, but the next one, and the one after that.

This case study uses a simplified scenario to demonstrate the RCA process. The same methodology applies to production systems with larger catalogues and more complex prompt architectures. Code examples use the OpenAI API but the behavioural patterns are model-agnostic.