Skip to main content

Lost in the Middle — Prompt Design that Beats LLM Position Bias

Cover image for Lost in the Middle — Prompt Design that Beats LLM Position Bias

TL;DR


1. What is Lost in the Middle?

LLM position bias

If you’ve built an LLM application, you’ve probably seen “the instructions in the prompt got ignored.” It happens often once the system prompt grows to hundreds of lines.

This is the phenomenon known as Lost in the Middle. It was systematically reported in Liu et al. (2023), “Lost in the Middle: How Language Models Use Long Contexts”.

The core finding: LLMs exhibit a U-shaped performance curve.

Performance

 │  ★                                    ★
 │   ★                                 ★
 │    ★★                            ★★
 │      ★★★                     ★★★
 │         ★★★★★★★★★★★★★

 └──────────────────────────────────────► Information position
   Beginning      Middle (degraded)        End

Concretely, on tasks that ask the model to answer questions referencing multiple documents, performance for information placed in the middle drops by more than 30% compared to the beginning or the end (Liu et al., 2023; the magnitude depends on model and task).

Why middle information gets lost — RoPE’s long-range decay

A leading cause is the long-range decay effect of RoPE (Rotary Position Embedding), which most modern LLMs use. Note that position bias isn’t only about RoPE — the structure of causal attention masks (the triangular masks that prevent each token from attending to tokens after it) and biases in the positional distribution of training data are all considered contributing factors.

RoPE adjusts attention strength based on the relative position between tokens. A standard Transformer remembers absolute positions (“which slot in the input is this token in?”), while RoPE encodes the relative distance between two tokens as a rotation angle of the vector. The rotation angle grows with distance, which naturally attenuates attention scores between far-apart tokens.

Instructions at the very start of a prompt are “far” from the most recent generated tokens. But because of how a causal language model works — generating tokens left to right with each token only able to attend to earlier tokens — the leading tokens are referenced repeatedly while processing every subsequent token. This cumulative effect ends up preserving information at the beginning and the end strongly.

Tokens in the middle don’t get the same cumulative leverage as the beginning, and they aren’t close to the generation point like the end either. They fall into an “attention valley.”


2. Concrete examples from real projects

How long does a prompt have to be before this matters?

With recent models, you almost never see this on prompts a few dozen lines long. In my experience, the impact starts to show up at system prompts of several hundred lines — for example, RAG setups injecting large amounts of context, or agent applications with complex rule sets.

The problem persists in the latest models. Modarressi et al. (2025), “NoLiMa: Long-Context Evaluation Beyond Literal Matching” (ICML 2025), found that 11 of 13 models — including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet — fell below 50% of their short-prompt baseline performance at a 32K-token context. Subsequent evaluations showed the same trend on GPT-4.1 and Gemini 2.5 Flash.

Chroma Research’s “Context Rot: How Increasing Input Tokens Impacts LLM Performance” (July 2025) tested 18 models for long-context degradation. Interestingly, the failure mode differs by model family: GPT-family models tend to confidently return wrong answers (hallucinate), while Claude-family models tend to abstain when uncertain. The same study reports that the U-shape Liu et al. observed wasn’t consistently reproduced — so position bias may manifest differently depending on task and model.

I’ve personally observed similar behavior on GPT-4.1 mini. Across model generations, position bias is easing but not eliminated.

The example below is simplified for clarity. In real settings, you’d have dozens of similar sections stacking up to several hundred lines or thousands of tokens — that’s when the problem appears.

Middle rules ignored in a system prompt

Imagine a system prompt with the following section structure spanning several hundred lines:

You are a customer support assistant.            ← Near the top: followed

## Basic rules
- Respond politely
- Address the user by name

... (dozens more sections) ...

## Response format                                ← Buried in the middle
- Keep answers within 3 sentences
- Use bullet points

## Data reference guide                           ← Buried in the middle
- Always look up pricing in the database

... (many more sections) ...

## Prohibited                                     ← Near the bottom: followed
- Don't recommend competitor products
- Don't ask for personal information

The “Basic rules” at the top and the “Prohibited” list at the bottom get followed, but the “Response format” and “Data reference guide” buried in the middle get ignored. The longer the prompt, the more often you hit this pattern.

Missing requirements in code generation

The same pattern shows up in code generation when the requirements section is long. The tech stack at the top and the response format at the bottom get followed, while validation and error-handling requirements written in the middle drop out entirely. If the whole prompt is short, no problem — but as context and examples grow, the impact starts showing.


3. The tail checklist pattern

Overview

The most practical countermeasure for Lost in the Middle is the tail checklist pattern. You restate the important instructions as a checklist at the very end of the prompt, prompting the LLM to “double-check.”

Before / After

The example below simplifies a system prompt that would normally span several hundred lines. In practice each section is more detailed, with many rules and chunks of context in between.

Before (middle instructions get buried):

You are a code review assistant.

## Review perspectives
... (5 items)

... (many sections: coding standards, language-specific rules, edge cases...)

## Output format                       ← Buried in the middle
- Classify severity as High/Medium/Low
- Attach a code example for each suggestion
- State the impact scope

... (more sections)

## Code under review
{code}

After (checklist appended at the end):

You are a code review assistant.

## Review perspectives
... (5 items)

... (many sections: coding standards, language-specific rules, edge cases...)

## Output format
- Classify severity as High/Medium/Low
- Attach a code example for each suggestion
- State the impact scope

... (more sections)

## Code under review
{code}

---
## Final checklist before output       ← Added here
Before producing the answer, confirm all of the following:
- [ ] Did you address all 5 review perspectives?
- [ ] Did you assign a severity (High/Medium/Low) to each finding?
- [ ] Did you attach a code example to each finding?
- [ ] Did you state the impact scope?

By placing the checklist at the end, the LLM “re-recognizes” these constraints right before generating output. You’re flipping the U-shape to your advantage by putting the verification items in the position where attention is highest — the end.

In my experience, after introducing this pattern the rate of middle-buried instructions getting ignored dropped noticeably. It works especially well for instructions like “output format,” which tend to live in the middle of prompts.

Real example: improving structured JSON output

I ran into this concretely while using LangChain with OpenAI models for a task that extracted structured JSON from free-form user text. The setup used LangChain’s with_structured_output, so the schema and field descriptions were defined via Pydantic’s Field(description=...), while extraction rules for each field (required vs. optional, default values, format specifications, etc.) lived in the prompt.

As the number of fields grew, extraction accuracy for fields described in the middle of the prompt visibly dropped. Field rules near the top and bottom were applied fine, but fields buried in the middle came back as null or with wrong values — exactly the U-shape.

Adding a reminder at the end of the prompt (after the user input) measurably improved extraction accuracy for those fields.

from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

# Pydantic schema (description is also passed to the LLM)
class TaskSchema(BaseModel):
    category: str = Field(description="Choose from the predefined categories")
    priority: int = Field(description="Integer 1-5")
    due_date: str | None = Field(description="ISO 8601 date, null if unknown")

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an assistant that extracts task information.
Extract structured data from the user's input."""),
    ("human", """{user_input}

## Pre-output check
Before responding, confirm the following fields are extracted correctly:
- "category": must be picked from the predefined categories (do not guess)
- "priority": must be an integer 1-5
- "due_date": must be ISO 8601 (null if no date is in the input)"""),
])

structured_llm = llm.with_structured_output(TaskSchema)
messages = prompt.format_messages(user_input=user_input)
result = structured_llm.invoke(messages)

What’s notable is that adding this tail reminder had almost no effect on the other fields (the ones that were already being extracted correctly). When I tried the same instructions as emphasis on the field definition in the middle of the prompt, surrounding fields would sometimes drift slightly, but the tail placement showed virtually no such side effects. The tail checklist lets you reinforce a weak spot in a targeted way without breaking the existing output.

One caveat: when you tweak the extraction rules in the prompt, you have to update the Pydantic model’s Field(description=...) to match — otherwise the prompt and the schema disagree, and accuracy can suffer despite your fix. with_structured_output passes the schema’s description to the LLM as well, so prompt and schema need to stay in sync. It’s a mundane point but easy to overlook in practice.

On injecting domain-specific knowledge in LangChain, the LangChain blog post “Incorporating domain specific knowledge in SQL-LLM solutions” recommends dynamically retrieving relevant few-shot examples rather than relying on a static prompt:

A more powerful approach is to have a robust dataset of good examples, and dynamically include those which are relevant to the user question.

Specifically, the post shows building a custom Retriever Tool backed by a vector database to fetch examples semantically similar to the user’s question. For structured-output tasks with many fields, dynamically selecting and placing the rules relevant to the input — rather than statically listing every rule — may be less susceptible to Lost in the Middle.

When the tail checklist isn’t enough

The technique has limits.

Implementation tips

Tips for using the tail checklist effectively:

1. Write the checklist as "verification items," not as a copy of the body
   - Bad:  Pasting the same prose
   - Good: A concise list of points to verify

2. Cap the list at 5–10 items
   - Too many backfires (Lost in the Middle inside the checklist itself)

3. Explicitly say "verify before answering"
   - Encourage a verification pass

4. Prioritize the items most often missed
   - Don't restate everything; emphasize what's empirically dropped

4. Other approaches

The tail checklist is lightweight and effective, but there are also approaches that improve the structure of the prompt itself, or address the issue outside the prompt — like the RAG pipeline.

Sandwich strategy

Place the most important information at both the beginning and the end of the prompt.

## Most important rule
Always return output in JSON format.

## Context
{lots of context...}

## Additional info
{more context...}

## Reminder: always return output in JSON format.

You’re putting the critical instruction at the two ends of the U-shape — the highest-performing positions — so it’s simple but effective. The trade-off is that you have to pick a single “most important” item, which makes it a poor fit when you want to emphasize multiple instructions at once.

XML-tag structuring and section splitting

Use XML tags or Markdown headers to clearly partition the prompt into sections that are easy for the LLM to parse.

Anthropic’s prompt engineering tutorial recommends separating data and instructions with XML tags. By bracketing input data with tags like <sentences>...</sentences>, the LLM can more clearly distinguish the data region from the instruction region, which can reduce the risk of missing middle information. Note that XML structuring doesn’t eliminate position bias by itself — it’s better used in combination with other techniques.

<system>
You are a data analysis assistant.
</system>

<rules>
<rule priority="high">Always cite the source of any number</rule>
<rule priority="high">Mark estimates explicitly as "estimated"</rule>
<rule priority="medium">Include axis labels in chart descriptions</rule>
</rules>

<context>
{the data to analyze}
</context>

<output_format>
{output format specification}
</output_format>

Adding a priority attribute also gives the LLM a hint for judging importance. Making it explicit “what is written where” through structure helps reduce the risk of middle information being buried.

Strategic document placement in RAG

In a RAG (Retrieval-Augmented Generation) pipeline, the ordering of retrieved documents directly affects answer quality.

def reorder_documents(docs: list[str], scores: list[float]) -> list[str]:
    """
    A Lost in the Middle countermeasure: place the highest-relevance
    documents at the beginning and the end.

    Example: scores [A(0.9), B(0.8), C(0.7), D(0.6), E(0.5)]
    Result:  [A(0.9), C(0.7), E(0.5), D(0.6), B(0.8)]
              ^^^^^^                           ^^^^^^
              High score at head        High score at tail
    """
    scored_docs = list(zip(docs, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    head = []  # head side (even indices: 1st, 3rd, 5th...)
    tail = []  # tail side (odd indices: 2nd, 4th, 6th...)

    for i, (doc, score) in enumerate(scored_docs):
        if i % 2 == 0:
            head.append(doc)
        else:
            tail.append(doc)

    # Reverse the tail so the highest-scoring item lands at the very end
    return head + tail[::-1]

By keeping the lowest-relevance documents in the middle and the highest-relevance ones at the ends, you reduce the risk of important information being overlooked.

Quantitative validation of position vs. accuracy

Lost in the Middle also affects few-shot prompting. Anthropic’s blog post “Prompt engineering for Claude’s long context window” quantitatively evaluates techniques for improving information retrieval from long contexts — like extracting relevant quotes first before answering, and adding correctly answered Q&A examples to the prompt.

If you want to measure how much position bias affects your own prompts, building a validation pipeline informed by these benchmarks is a good starting point.


5. Summary

Comparing the techniques

TechniqueUse caseToken costImplementation difficulty
Tail checklistSystem prompts in generalLow (just the list)Low
Sandwich strategySingle most-important ruleLow (one restated line)Low
XML-tag structuringMultiple kinds of informationMedium (tag overhead)Medium
RAG document placementRAG pipelinesNone (reorder only)Medium

Pay attention to information placement

Use a tail checklist for double-verification

Continuously monitor prompt quality


Lessons learned


References

ZSL
ZSL

AI Engineer

Researching and practicing development workflows powered by Generative AI.