If you’ve ever used ChatGPT and felt that, once it takes a wrong turn the conversation becomes unrecoverable, you’re not alone and now there is research to back it up.

The paper LLMs GET LOST IN MULTI-TURN CONVERSATION took single-prompt instructions and broke them into shards to see how LLMs would perform when given information over multiple messages, rather than all at once. In some cases, even on flagship models like Gemini 2.5 Pro, performance dropped by 40%.

The TL;DR:

  • 39% performance drop on average when tasks unfold over multiple messages versus a single, fully specified prompt upfront
  • Unreliability more than doubles
  • Reasoning models performed just as poorly as non-reasoning models

This is really critical for both app developers and everyday AI users.

We'll dive into the experiments, why models fail in multi-turn conversations, and what you can do as a developer or user of AI apps.

what's up everyone how's it going today we're going to be looking at a research paper that's really important for anyone who's working on any chatbot based application um any agent or someone that just frequently uses um something like chat GBT And so name of the paper um LMS get lost in I think it's multi-turn conversations the full name Yeah in multi-turn conversations And basically what it shows is that it looks at the performance difference from when you get a prompt that has all the information you need right up front to solve the problem versus when you have to go back and forth to clarify and get info from the user to then go and solve that problem And the headline number is 40% drop in performance when you have to do multi-turn versus singular Reasoning models fail small models fail um unreliability doubles um and a bunch of not good stuff And so we'll look a little bit at the experiment and also go over why this happens and then the solutions as well

So basically what the researchers did was take a prompt from a popular data set where you see again here where all of the information is up front and then they broke it up into many different shards that the LLM would have to retrieve from in conversation

And you'll notice it's not just like there's a little bit of a process here It didn't just cut this up um based on characters or anything like that Basically it takes that prompt it uses an LM to segment it and then it has um a rephrasing process And so this was basically to show that like it's not just the exact same prompt as it was initially There's a little bit of vagueness added to it It's a little bit different Um and then they do a check to make sure while these shards are different they still have all the important information and relevant information such that the problem is solvable And then there's a manual um inspection

And so couple different task types code generation SQL queries math problems 200,000 conversations performance aptitude reliability unreliability We're going to f focus mainly on performance in terms of the methods tested So we have full that's the normal prompt from the original benchmark Sharded which is when the model will be getting those shards from the conversation concatenated is basically taking those shards and just gluing them all together and sending that Um so it's very similar to full except that it's been these shards have been rephrased Recap is sharded plus one step at the end that adds all the um shards together So it's like a little bit of a combination of these two And then snowball is it will prepend the shards with every subsequent step as you continue to give the shards to the model And all right big table of results So what we see here so this is full This is concatenated This is sharted They have the other methods that we'll look at in a little bit And this looks at um concatenated over full And then sharded over full So relative performance for those um for those methods You'll see concatenated mostly stays the same um in terms of its relation to full The concatenated to full kind of um column over Here you're seeing a lot of 91s 90 93 93 90 So pretty good Even in this case 101 which means that it did better um overall Here we go 1032 for what was that for R1

So generally concatenating seems to be pretty similar performance And again remember these are like different shards This isn't just like taking the exact original prompt Sharded performance obviously tanks here You can see that by a lot of red Um this is where that 40% decrease in performance comes from You can see that in the column here 60% 50 52 50 like these are much lower uh relative performance

So yeah nearly every model uh accuracy falls and sharded versus full average degradation is 40% concatenation does well Small models fail big models fail Um you know GBT 2.5 Pro loses about 30 to 40% which is just as much as the small models And then the re uh the reasoning models again um they fail as well partially related to their large outputs

And then I mentioned they also ran on like a smaller set of models um recap and snowball So again for reminder recap still conversation then it gives it all at the end Snowball they prepen as you collect them along the way Neither did super well So if we look here full is at 87 concat it's at 85 Uh recap is 66 and snowballs at 62

And so again those methods don't really help

Yeah for reminder here

And so why they basically noted four reasons why this might have happened Premature answer attempts So uh the first 20% of turns averages a 30% score So the model was like just when it answers quickly it's more likely to give a bad answer It was if it was answering in the last 20% it got 65% So models attempt to answer too soon um and then lock in on mistakes So even if they are told that it's wrong and then to keep going it's a little bit like the whale's been poisoned

which is a little bit more what happens here So um look at this chart So this chart um does answer length versus answer um attempt And you can see in the beginning for full and concat and then we have sharted They start at the same spot obviously for their first answer that the length is the same and then the further it goes on the longer um their answer attempts get And so LMS will often make these like wrong guesses and then cling to those mistakes and then that is what can drive up the length of the answers because it's still holding on to all these different assumptions we have lost in the middle um you know this pretty well known but that LLMs will remember the first part and the last part more so and so the longer the turn in conversation the more there is to forget in the middle decreasing performance and then being over verbose verbose was like another thing here especially this happens a lot with reasoning models um so as a response length increases the average accuracy declines um it's not by a ton you know we're seeing 40% % versus 35% but you know 5% difference is is definitely uh definitely notable Um and so that's another issue to kind of worry about like over verbosity

So what can you do clearly the concatenation solution seems to be the best And so what you could do is if you have a multi-turn thing going on once you collect all the information from the user send that via a fresh LLM in a single shot prompt versus having the baggage of this like message history Um so you use an LLM to get the correct information in some capacity and then you send it over to a fresh LM to do actually whatever the task that needs to be done

And so that's this point here But then also more importantly is to like test really test these like multi-turn flows Um you'll see a lot of like evaluation sets are like not like relative to what actually happens in production they look more like

this you know eval sets but you know use case of your product usually looks more something like this Um it would be great if it was they all the prompts looked like this That's just not um relevant So testing that is something that's important And so yeah that's about it Super interesting It'll be linked below Um and yeah got to start testing for multi-turn stuff



Main image should a graph of performance of LLMs during multi-turn conversation versus single shot

Single-Turn benchmarks miss conversational complexity

Most popular benchmarks hand the model a fully specified task up front, in a single prompt. In that setup, the model sees every requirement in one go and then responds.

But real conversations and AI applications often don’t work that way. Usually users reveal information piece by piece and work with the model to clarify certain instructions. Think about ChatGPT, deep research, etc.

By not accounting for these types of use cases, benchmarks give a rosy overview of a model’s capabilities. This paper tests conversational style flows that are much more representative of common AI applications and use cases.

Experiment setup

To test conversational performance, the researchers took single-turn prompts from popular benchmarks and sliced each into a series of smaller “shards”. Each shard reveals just one piece of the full prompt.

Fully specified prompt versus sharded instructions

These shards were generated via a four-step process, not just by cutting up the original full prompt:

  • Segmentation: An LLM splits the full prompt into non-overlapping segments via a few-shot prompt.
  • Rephrasing: Each segment is rephrased and reordered into conversational “shards”.
  • Verification: The original and sharded instructions are tested side-by-side, to confirm no information loss.
  • Manual Review: Authors do a final, manual, review.

Graphic showing the steps of the sharding process

At each turn, the model sees only the next shard, responds, and may attempt an answer or ask for clarification.

Key components of the setup:

  • Tasks & models: Six task types (code generation, SQL queries, math problems, API-style Actions, data-to-text, and long-document summarization) across 15+ LLMs.
  • Scale: Over 200,000 synthetic conversations.
  • Metrics:
    • Performance (P): Overall average accuracy
    • Aptitude: The 90th-percentile score
    • Unreliability (U₉₀₋₁₀): The difference between the 90th- and 10th-percentile scores, capturing the range of best to worst.
    • Reliability (R): Represents the model’s consistency by quantifying how tightly clustered its performance is around the average. Calculated as 100 – U₉₀₋₁₀.

Methods tested

Showing how the different testing methods compare

Evaluation Settings

  • FULL – The model sees the original, fully specified instruction in one prompt (single-turn).
  • SHARDED – The instruction is split into N “shards,” revealed one per turn.
  • CONCAT –  All shards are concatenated into a single prompt. Differs from FULL in that the shards are less specific.
  • RECAP – Same as SHARDED, but on the final turn all previous shards are sent. In this case, there is a message history (unlike in CONCAT).
  • SNOWBALL – Like SHARDED, but at each turn all prior shards are prepended before revealing the next one.

Results

Let’s look at some data:

Table of results from the main experiment
Average performance of a variety of LLMs across the six tasks. The final two columns represent the average percent drop for CONCAT and SHARDED versus FULL across all six tasks.
  • Universal drop: Nearly every model’s accuracy falls in SHARDED vs. FULL, with an average degradation of 39%.
  • CONCAT to the rescue: CONCAT performance averages 95.1% of the FULL baseline, which shows that information loss from sharding isn’t the reason why performance drops.
  • Smaller models perform slightly worse: Llama3.1-8B-Instruct, OLMo-2-13B, and Claude 3 Haiku show larger CONCAT hits (86–92% of FULL).
  • Flagship models fail too: Claude 3.7 Sonnet, Gemini 2.5 Pro, and GPT-4.1 lose 30–40% in SHARDED mode, just as much as smaller models.
  • Reasoning tokens don’t save the day: o3 and Deepseek-R1 degrade just like their non-reasoning counterparts. Their longer (~33%) responses give more room for incorrect assumptions.

Next up, the researchers tested SNOWBALL and RECAP methods, on GPT-4o, and GPT-4o-mini to see how much some level of refreshing the model's memory could recover lost performance.

As a reminder, here is how each method works:

RECAP: After the usual multi-turn shards, add one final user turn that repeats all previous shards before the model’s last attempt.

SNOWBALL: At each turn, prepend all prior shards before revealing the next one, giving the model continuous redundancy.

A small table comparing 4o-mini and 4o on RECAP and SNOWBALL
Order: FULL, CONCAT, SHARDED, RECAP, SNOWBALL

  • RECAP recovers to ~66–77%, up from ~50% on SHARDED
  • SNOWBALL adds ~12–15 percentage points over SHARDED, but underperforms compared to RECAP and trails FULL by ~15–20 points.

Why models fail at multi-turn conversations

In general, the researchers determined four major failure reasons.

1. Premature answer attempts

Across every model, accuracy increases when the first answer attempt happens further in the conversation.

Table showing model performance at different points when they attempt first answer

  • First 20 % of turns: 30.9% average score
  • Last 20 % of turns: 64.4% average score

Models that attempt to answer too soon tend to lock in mistakes. The longer they waited, and as they got more information, the better their chance to put it all together and answer correctly.

2. Verbosity inflation (answer bloat)

Throughout the multi-turn conversation, the LLM will generate incorrect answer attempts and related assumptions. As the user reveals more information, the model doesn’t always invalidate previous incorrect assumptions.

This leads to longer final solutions, aka, “answer bloat”.

Four graphs showing the number of tokens used in output generation versus method type

  • Final answers grow well beyond single-turn baselines (Code climbs from ~700 chars to over 1,400).
  • Assumptions stick: New shards rarely invalidate prior guesses, so each response layers on more content.
  • Result: Bloated, error-ridden outputs

3. Lost-in-Middle

It is well-known that models tend to pay most attention to what they see first and last in a given context window, often skipping over content in the middle.

Below is an analysis of how often models cite each document in their running summaries during sharded simulations. At every turn, the LLM produces an updated summary (y-axis) that may include citations from any documents revealed so far.

Graph showing how LLMs are more likely to cite documents in the beginning or end of their context window

4. Over-Verbosity Harms Performance

As response length increases, average accuracy declines (not by a ton, but not a little):

Table showing how generation length compares to performance

Longer model outputs risk veering off course by introducing extra assumptions.

Advice for agent and app builders

If you’re building LLM-based applications with multi-turn conversations, these strategies can help mitigate performance degradation:

  • Test multi-turn flows: Explicitly include multi-turn scenarios in your test cases and evaluation suites.
  • Consolidate before generation: When you’re ready to generate an output, batch all collected user context into one prompt and send it as a fresh LLM call instead of continuing to drip information.

Implications for everyday AI users

  • If your chat goes off the rails, start fresh. Rather than wrestling with a derailed conversation, start a new chat and try to give as much context as possible up front.
  • Consolidate mid-chat. Ask the model “Can you summarize everything I’ve told you so far?” then paste that summary into a fresh session to reset context.
  • Keep prompts focused and concise. Short, pointed messages can help the model stay on track. Try to avoid rambling instructions spread over multiple back-and-forths, which create more chances for incorrect assumptions.

Conclusion

This was an eye-opener. So many of the AI applications out there today are multi-turn, but I’m very confident in saying that multi-turn conversations are not thought of enough when doing testing. Even top LLMs can get lost after just two back-and-forths (!!), resulting in accuracy declines of 39%. Whether you’re testing in PromptHub or some other tool, don’t sleep on having multi-turn test cases as a core part of your testing and eval workflow.

Headshot of Prompthub co-founder Dan Cleary
Dan Cleary
Founder