When ‘Thinking’ Models Stop Thinking

Apple may have just made their biggest splash in AI and it has nothing to do with any software. Their recent paper, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” has been making waves because it delivers a blunt critique of today’s reasoning models.

The authors argue that the reason these models are so great on benchmarks comes from over-exposure to those benchmarks, not from the “reasoning”.

The TL;DR of their findings:

Non-reasoning models beat reasoning models on low-complexity tasks
Reasoning models beat non-reasoning models on medium-complexity tasks
Performance of both model types collapse at the same level of complexity

But there is so much more to it! Let’s jump in.

‍

apple may have just made its most significant contribution to AI in the past 12 months and has nothing to do with software or Siri It's actually a paper that they published in the past couple weeks called the illusion of thinking understanding the strengths and limitations of reasoning models via the lens of problem complicity

And so I dove deep into this paper It was making the round so you might have seen it already Um but I think a lot of people missed out on some nuance throughout But overall there is a lot of really interesting information here studying essentially reasoning models

And so let's dive into it So first we'll get myself out of the way a little Great That's cool So basically how it started off they developed a new set of tests So rather than doing like typical data sets and benchmarks they created their own puzzles The reason that they did this was they thought that models have been have seen those benchmarks so much that they are going to be particularly good at solving them and it's not a reflective it's not reflective of how smart they are You know they've just seen it over and over and over again

Um and so that they will be just trained to do really well on on those benchmarks On top of that the those benchmarks don't allow you to like shift up and down the difficulty of a given problem And you can't like make a math problem harder It is that problem of itself but you can't like twist up the difficulty dial So they created four puzzles You might be familiar with a couple The Tower of Fenoi I think is probably the most popular one where you have all these rings and you need to get them all to the end in some sequential fashion

And you can see these are examples of puzzles that are not very popular so not likely to be like over um exposed in the training data And you can turn up the difficulty easy So instead of having three rungs you could have four five six whatever that might be So again better complexity not a lot of contamination risk um and there's you know algorithmic solutions This is a known answer a known happy bath And so they tested these across you know the big thing they wanted to get to here was okay how good are these these reasoning models actually when we test them on things that aren't benchmarks

Um and so they tested models that had a reasoning and non-reasoning counterpart So in this case it was claw 37 sonnet thinking and then claw 37 sonnet and then they did deepse versus uh deepse v3 And so we'll zoom in a little bit on this

Cool So what you'll see is that like in the beginning even just focusing on this tower of Hanoi for Claude it's kind of a similar shape when you look elsewhere as well but you'll see for the complexity and the number of discs for one two and three it's basically the same and you know that's three different levels that's basically the same level accuracy for both the reasoning and the non-reasoning models Then in the middle here reasoning model So this blue line outperforms the non-reasoning by a fair margin But then interestingly they both collapse essentially at the same point

Um here it's 10 discs for blocks world Again you see similar similar performance a drop where the reasoning model is much better and then they kind of collapse at the same point Some cases like the re non-reasoning is even above similar shape here except you don't have the kind of same performance at the lower complexity but immediate drop at the same level and it's kind of the same across all the graphs So generally at low complexity the non-reasoning models win um in that they are either performing above a same level um or like very very very close and so I think that's a win for the the non-reasoning models That's what you see in this kind of yellow portion here

There's no yellow here I guess in the medium that's where the reasoning models win as the complexity increases and then at the end at the high complexity they both collapse and collapse at the same part And so they deem you know they call this actually estimated me uh the thinking curve So basically at a certain point you see that drop Um and so looking at this each curve shows how many thinking tokens a reasoning model generated before committing to an answer And you'll see at low complexity the lines rise quickly

These chains of plot help the model double check itself Um and yet the to the spent stays high Then it plateaus and then essentially at the high level and we'll we'll look at the graph again after looking at this Um at the high level as the accuracy starts to dive the thinking token uh goes down as well model still has animal context to keep reasoning but it simply produces less reasoning at that really high complexity level and so let's do this again So we'll just focus on this graph

So number of thinking tokens again we see the lines rise quickly So what's interesting is like you'll know if you use reasoning models they will reason about like anything and everything You could tell like 2 plus two there was like a tweet going around that said like hi my name is Sam Alman and it reasoned for you know 03 or whatever when these newer reasoning models reason for like 5 minutes So reasoning shoots up hits a plateau and then it slowly starts to decline And this kind of aligns with where the performance drops off as well to a degree Right as we see this flat line that's where you see the decline to start to happen And we see it really across a bunch of different of the games and with the different uh models as well up plat plateau degrade So it's interesting as the task grows harder reasoning models initially will spend more time thinking um and then while accuracy declines until a point then it just drops off completely Next

interesting aspect of the paper is around where the answers live in the the chain of thought um basically wanted to see okay are the models like getting the right answer and then continuing thinking or are they getting the wrong answer and then get getting to the right answer through through thinking and what we've kind of seen again if you used a reasoning model in chatbt or anywhere else overthinking uneasy you know puzzles or questions um and you could see this by the green mark appearing first here so in Tower of Henoi you'll see the green mark appears first so it should stop but it keeps going um and it even generates some incorrect answers later on in its thought process in the medium Ben um as you know again as it gets more complex and harder first the first ideas are wrong but eventually it gets to the correct ones and that's kind of what you would want from from reasoning It thinks through some options eventually gets to one that's good

and you know beyond a certain point it's just not going to get the the correct answer So this is that that red part of the curve we saw before where the complexity just you know makes the performance go to zero

and then maybe my favorite part was basically so each one of these puzzles has a algorithm you can use to solve it So there's like a happy path there's an answer and in these cases they passed the algorithm in the prompt to solve these problems and so did that lead to better performance no And did it shift that kind of thinking cliff that we saw uh no So basically what we're looking at here is the default So the normal versus when you the algorithm is literally given in the prompt of how to solve the problem It's like the answer key to a degree

You can see it's basically the same in both cases Um the default's like a little bit above here here like you know excuse me the algorithm is above here The default is above here So it's like generally pretty close you could find directions where it either is flipped but performance stays the same which I thought was really really interesting Um

and yeah as I mentioned here the failure mode shifts from searching for a plan to misexecuting a known plan skipping moves replicating them halting early Like these are all the failure modes that were discovered Um but I think this really shows something deeper that there is like something a lot of unknowns about the these reasoning models and the more I read about them the more I'm not super convinced that they perform a lot better on certain tasks It would have been cool to see that actually ran with a non-reasoning model but we don't have that And this kind of hard bins back like this was what was ringing in my head was a recent video we made about um more like alignment Um and so it was about reasoning models don't always say what they think This is an anthropic paper Can open that up So this was a really fun read Um basically what it came down to is anthropic ran a bunch of tests It basically showed that the models are not always faithful to whatever is in their reasoning So they're not always going to say like the answers that are in like their thing process The most like concrete showing of this was they were they gave the model a hint and then see it was like a multiple choice and they gave them a hint maybe that like C was probably the answer and then in their reasoning it didn't show anything about

them using the hint You would expect it to say oh I've got the hint so I'm going to do this and it it didn't do that It kind of hid that to a degree Um so it the model got a hint but doesn't acknowledge So that's like a a faithfulness issue And you know it's hard to even know who these reasoning traces are for and how much they actually help the model get to a final answer how much it is just like burning tokens and whether that actually increases performance And so we actually have another paper that we're going to be discussing soon that touches on all this looking on reasoning models How the increased thinking actually leads to better performance or does having more tokens lead to assumptions that are incorrect leading the model down the wrong direction So wrapping up

on really easy problems and solutions a non-reing model is definitely going to be a better fit when it's in that kind of middle area which of course this is all very context dependent Um but reasoning models are going to perform well They have like a use case for sure Um and then what's interesting is they typically both collapse at the same level that what we call the thinking cliff which is where reasoning tokens um they eventually just drop you know and that aligns with the accuracy dropping um as well It's actually right before so it's like it's not like you know correlation causation is is tough to know here Um so it's something to I would say to run your own tests on even when given the algorithm the reasoning models you know got the same level of performance which I think is really interesting It feels like they just kind of go down these these rabbit holes and they don't take as much of the context but that's hard to tell

The not faithfulness you know just seems really relevant here just because we're talking about the different ways that reasoning models maybe don't perform the way that we're being told I think is really important to note So again maybe budget longer traces for like a medium complexity task and then skip them on easy tasks And don't you know you got to audit this stuff You got to look at the traces Like that will be really helpful um in figuring out what's going on and what's wrong And so that's it for today Reasoning models again the more that info that comes out the less confident I am I am in them We'll have more on this soon and we'll talk to you next time

‍

A new set of tests

The researchers decided to test reasoning and non-reasoning models on a fresh set of four logic puzzles, rather than typical math and coding benchmarks. They did this for two main reasons:

Benchmark leakage hides true capability: On standard sets like AIME 24 and AIME 25, accuracy shifts can be explained as much by training-set exposure as by genuine reasoning skill.
Math problems don’t let you turn the difficulty dial: Established benchmarks usually have a single answer. This makes it impossible to turn up the difficulty to see where performance begins to unravel.

‍

The four puzzles

‍

diagram of puzzles used in the experiments

‍

Tower of Hanoi – move n disks across three pegs without violating size constraints.
Checker Jumping – swap two colored groups of checkers on a linear board using slide and jump moves.
River Crossing – ferry pairs of actors and agents across a river under safety and capacity constraints.
Blocks World – rearrange stacks of labelled blocks into a target pattern while obeying “top-block only” movement.

What these puzzles allow for

Fine-grained complexity control: Each game can easily scale in difficulty.
Minimal contamination risk: Especially compared to popular benchmarks, these exact puzzle instances are much less likely to appear in the pre-training process.
Algorithmic solutions: Each puzzle has a set of rules and a known happy-path.

Performance across the three levels of complexity

Since each puzzle can easily scale in complexity (ex: adding another disk in the Tower of Hanoi puzzle), the researchers test all the models at varying levels of complexity. They test reasoning models and their non-reasoning counterparts. For example, Claude 3.7-Sonnet Thinking and Claude 3.7 Sonnet.

‍

Eight charts comparing accuracy and complexity

‍

The graphs are divided into three colored areas:

Low complexity (yellow) – non-reasoning models win. On easier versions of each puzzle the non-reasoning models win, albeit by a close margin in most cases. Extra reasoning isn’t needed for trivial tasks.
Medium complexity (medium) – reasoning models win. As complexity increases, the reasoning models start to noticeably outperform their non-reasoning counterparts.
High complexity (red) – both collapse. Once the curves enter the right-hand pink region, accuracy for both model types falls to nearly zero.

‍

The Thinking Cliff

Let’s look at the bottom row of four graphs in the graphic below.

‍

Accuracy versus complexity across 8 graphs

Each curve shows how many thinking tokens a reasoning model generated before committing to an answer.

Low complexity: The lines rise quickly, brief chains-of-thought help the model double-check itself, yet the total token spend stays modest.
Medium complexity: Thinking-token counts plateau in the low-thousands, exactly where the accuracy curves (top row) reach their peak.
Approaching high complexity: As soon as those accuracy curves start to dive, the thinking-token lines also start to dive. The model still has ample context to keep reasoning, but it simply stops. That bend is the Thinking Cliff.

As tasks grow harder, reasoning models initially spend more thinking tokens while accuracy declines gradually, until a point where both effort and performance plunge at once.

Where answers live in the chain of thought

The researchers also did a dive into how the model reasons and the content of the reasoning traces themselves.

‍

A variety of graphs showing when thinking models get correct and wrong answers in their reasoning tokens

‍

Over-thinking on easy puzzles: In the less complex cases the green ✓ marks appear first (for Tower of Hanoi), but the model keeps chatting anyway, producing unnecessary reasoning tokens

Late fixes in the medium band: At moderate complexity the pattern reverses: the first ideas are wrong, and the correct line of reasoning appears only near the end of the trace. That’s where the extra tokens finally pay off.

No path on hard puzzles: Beyond the cliff the ✓ marks vanish; the model never reaches a valid solution

Handed the answer, lost in execution

Now for my favorite part of the paper. The researchers re-ran the Tower of Hanoi experiment, but provided the algorithm needed to solve the puzzle directly in the prompt. In theory, the model no longer had to plan, only follow instructions.

Did that lead to better performance? Nope!

Did the collapse in performance shift further along? Nope!

‍

Accuracy versus complexity when algorithm is shared in the prompt

‍

Performance generally stayed the same, and the collapse occurred at roughly the same point.

The failure mode shifts from searching for a plan to mis-executing a known plan, skipping moves, duplicating them, or halting early.

Alignment lens, when “thinking” hides shortcuts

It gets worse for reasoning models.

Not only do they “give up” on very hard puzzles, collapsing alongside non-reasoning baselines, but their reasoning itself can hide how they actually solve problems. Their final answer is not always faithful to the reasoning tokens used.

We covered this in a recent video, but Anthropic recently released a paper Reasoning Models Don’t Always Say What They Think that put this to the test.

The researchers slipped subtle hints into multiple-choice questions. Models used the hints to get the right answers, yet referenced them in fewer than 20 percent of their reasoning traces. The model found a shortcut (the hint), but doesn’t acknowledge it in its reasoning.

So between the two papers, we have two complementary failure modes

Competence collapse (Apple paper) – accuracy nosedives once task depth crosses the Thinking Cliff.
Faithfulness collapse (Anthropic paper) – accuracy stays high, but the stated reasoning omits the true rationale, making alignment and evaluation unreliable.

Both results underscore a common lesson: final answers are not enough.

Wrapping up

Whether you’re shipping LLM-based features to production or just casually using chatbots, keep these points in mind:

Three-zone curve: Non-reasoning models win on easy puzzles, reasoning models take the lead at medium complexity, and both collapse once tasks cross some level of difficulty (the Thinking Cliff).
Thinking Cliff: Reasoning-token counts rise, then drop, just before accuracy plunges. Additional reasoning tokens stop helping and the model effectively gives up.
Execution still fails: Even when given the optimal Tower-of-Hanoi algorithm, reasoning models perform at the same level and fail at the same level of complexity.
Not faithful: Anthropic shows models can use hints to get the right answer yet omit them from their chain-of-thought.
Spend CoT where it pays off: Budget longer traces for medium-complexity tasks. Skip them on easy tasks.
Log and audit reasoning traces: Don’t just look at final answers!

‍

Dan Cleary

Founder

When ‘Thinking’ Models Stop Thinking

A new set of tests

The four puzzles

What these puzzles allow for

Performance across the three levels of complexity

The Thinking Cliff

Where answers live in the chain of thought

Handed the answer, lost in execution

Alignment lens, when “thinking” hides shortcuts

Wrapping up

Get the week's best prompt engineering and AI content

Join thousands of AI builders

More from the PromptHub Blog

Feature Launch: Pipelines

How to Automatically Pick the Right Model for the Right Job

Why LLMs Fail in Multi-Turn Conversations (And How to Fix It)