Context Order Bias: Why This LLM Only Sold Apples and Oranges

A simple fruit-selling chatbot consistently ignored two of its five products. The knee-jerk prompt fix masked the symptom. Methodical root cause analysis uncovered two interacting biases, and a permanent fix.

Nov 2025 12 min

The problem: two fruits never get sold

A conversational AI was built to recommend fruit to customers. Five products in the catalogue. Simple system prompt with product descriptions and a set of behavioural rules. The kind of setup that looks right on first inspection.

We ran it 10 times with the same user input, “hello, I’d like some fruit recommendations”, and counted which fruits appeared in the responses.

Strawberries and watermelons were never recommended. Not once across 10 conversations.

The setup

The application used a standard chat completion pattern, a system prompt with product descriptions, behavioural rules, and a user message:

system_prompt_input = '''
You are a helpful assistant that always provides concise
and relevant answers to user queries. You are to try to
sell the user fruit.

Below is your context:
Apple
Description: A crisp and juicy fruit with a rounded shape...

Banana
Description: This is a long, curved fruit with a soft,
creamy white flesh...

Orange
Description: A round citrus fruit known for its tough,
bright orange skin...

Strawberry
Description: A small, heart-shaped fruit that is bright
red when ripe...

Watermelon
Description: A very large, heavy fruit with a thick,
striped green rind...

Rules:
1. Always stay on topic.
2. Avoid unnecessary information.
3. Provide clear and direct responses only.
4. Reply in no more than a sentence.
'''

Typical output looked like this:

Try a crisp apple for a versatile snack, a sweet banana
for energy on the go, or a juicy orange for a refreshing
vitamin C boost!

Every response followed the same pattern: apples first, sometimes bananas, always oranges. Never strawberries. Never watermelons.

The knee-jerk fix (and why it didn’t work)

Without root cause analysis, the obvious fix is to add an instruction:

“All fruits should be represented equally in sales, including Apples, Oranges, Bananas, Strawberries and Watermelon.”

After this change:

Apple: 10, Banana: 3, Orange: 10, Strawberry: 3, Watermelon: 4

Better, but still heavily skewed. Apples and oranges still dominate. The prompt patch masked the symptom without addressing the cause.

At this point, it would be tempting to ship the fix and move on. The numbers improved. But this hasn’t uncovered why the bias exists, which means it’ll resurface in different forms as the product catalogue changes.

Applying root cause analysis

The process follows six steps: observability (capture inputs and outputs), categorisation (label the failures), pattern identification (find clusters), investigation (analyse structural causes), hypothesis testing (isolate variables systematically), and reporting (document findings for future reference).

Observability

We decomposed the prompt into its structural parts:

Categorisation

The failure counts gave us clear labels: Apple (10), Orange (10), Banana (7), Strawberry (0), Watermelon (0). A clean split between “always mentioned” and “never mentioned.”

Pattern identification

The cluster was obvious: the LLM heavily favoured fruits listed earlier in the context. The ordering in the prompt, Apple, Banana, Orange, Strawberry, Watermelon, correlated almost exactly with mention frequency.

Investigation

Two structural elements in the prompt could explain this:

  1. Context ordering, fruits listed earlier get more attention
  2. Output length restriction, “reply in no more than a sentence” forces the model to pick a subset, and it picks from the top

Hypothesis testing

We built a counting mechanism and increased test iterations to 50 for statistical significance:

fruit_count = {
    "apple": 0, "banana": 0, "orange": 0,
    "strawberry": 0, "watermelon": 0,
}

for response in list_of_responses:
    response_lower = response.lower()
    for fruit_name, root in fruit_search_roots.items():
        if root in response_lower:
            fruit_count[fruit_name] += 1

Then tested each hypothesis in isolation.

Hypothesis 1: Context order bias

Test: Reverse the context order to Watermelon → Strawberry → Orange → Banana → Apple.

50 iterations, reversed context order

FruitOriginal orderReversed order
Apple50 (always)24
Banana350
Orange50 (always)13
Strawberry0 (never)49
Watermelon0 (never)48

Reversing the order flipped the bias almost entirely. Strawberry and watermelon, previously invisible, now dominated. But 24 apples persisted despite being listed last, hinting at a residual bias worth investigating further.

Hypothesis 2: Output length restriction

Test: Remove the rule “reply in no more than a sentence.” Keep original context order.

Apple: 50, Banana: 50, Orange: 49, Strawberry: 46, Watermelon: 49

With the length restriction removed, all five fruits were recommended roughly equally. The model wanted to mention all products, the sentence limit was forcing it to truncate, and truncation favoured items listed first.

This works, but at the cost of longer responses and more tokens. Any fix using this approach needs to weigh verbosity against the business requirement for concise recommendations.

Hypotheses 3 & 4: Isolating apple bias vs. positional dominance

The residual 24 apples from Hypothesis 1 raised a question: does the model have an inherent preference for apples (perhaps from training data), or is position always the dominant factor?

Hypothesis 3: Ask the model to pick a random fruit from the list Apple, Banana, Orange, Strawberry, Watermelon.

Apple: 0, Banana: 5, Orange: 12, Strawberry: 20, Watermelon: 13

Zero apples, despite being listed first. When explicitly asked to be random, the model avoided the first item. Interesting, but contradictory to the earlier results.

Hypothesis 4: Same test, reversed list, Watermelon, Strawberry, Orange, Banana, Apple.

Apple: 31, Banana: 3, Orange: 15, Strawberry: 1, Watermelon: 0

The last-listed fruit (now Apple) dominated. This confirms positional bias is the primary driver, the specific fruit doesn’t matter. In the “random pick” framing, the model favoured items at the end of the list rather than the beginning, which is a different but related positional effect.

Root cause report

The knee-jerk fix was a fragile patch that only masked the symptom. Methodical RCA uncovered the real problem, enabling a permanent, robust fix.

Observed behaviour

The LLM consistently recommended apples, oranges, and bananas while ignoring strawberries and watermelons entirely.

Root causes identified

Two interacting biases:

  1. Context order bias, the model strongly favours items listed earlier in the context when generating recommendations
  2. Output restriction amplification, the “reply in no more than a sentence” rule forces truncation, which reinforces the order bias by cutting off items the model would otherwise have included

Permanent fixes

Understanding the root cause opens up robust solutions rather than prompt patches:

The process matters more than this specific example. Context order bias appears in any LLM application with a list of options, product catalogues, knowledge base articles, FAQ entries, support categories. The same RCA methodology applies.


This case study uses a simplified scenario to demonstrate the RCA process. The same methodology applies to production systems with larger catalogues and more complex prompt architectures. Code examples use the OpenAI API but the behavioural patterns are model-agnostic.