Apple may have just made their biggest splash in AI and it has nothing to do with any software. Their recent paper, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” has been making waves because it delivers a blunt critique of today’s reasoning models.
The authors argue that the reason these models are so great on benchmarks comes from over-exposure to those benchmarks, not from the “reasoning”.
The TL;DR of their findings:
- Non-reasoning models beat reasoning models on low-complexity tasks
- Reasoning models beat non-reasoning models on medium-complexity tasks
- Performance of both model types collapse at the same level of complexity
But there is so much more to it! Let’s jump in.
A new set of tests
The researchers decided to test reasoning and non-reasoning models on a fresh set of four logic puzzles, rather than typical math and coding benchmarks. They did this for two main reasons:
- Benchmark leakage hides true capability: On standard sets like AIME 24 and AIME 25, accuracy shifts can be explained as much by training-set exposure as by genuine reasoning skill.
- Math problems don’t let you turn the difficulty dial: Established benchmarks usually have a single answer. This makes it impossible to turn up the difficulty to see where performance begins to unravel.
The four puzzles

- Tower of Hanoi – move n disks across three pegs without violating size constraints.
- Checker Jumping – swap two colored groups of checkers on a linear board using slide and jump moves.
- River Crossing – ferry pairs of actors and agents across a river under safety and capacity constraints.
- Blocks World – rearrange stacks of labelled blocks into a target pattern while obeying “top-block only” movement.
What these puzzles allow for
- Fine-grained complexity control: Each game can easily scale in difficulty.
- Minimal contamination risk: Especially compared to popular benchmarks, these exact puzzle instances are much less likely to appear in the pre-training process.
- Algorithmic solutions: Each puzzle has a set of rules and a known happy-path.
Performance across the three levels of complexity
Since each puzzle can easily scale in complexity (ex: adding another disk in the Tower of Hanoi puzzle), the researchers test all the models at varying levels of complexity. They test reasoning models and their non-reasoning counterparts. For example, Claude 3.7-Sonnet Thinking and Claude 3.7 Sonnet.

The graphs are divided into three colored areas:
- Low complexity (yellow) – non-reasoning models win. On easier versions of each puzzle the non-reasoning models win, albeit by a close margin in most cases. Extra reasoning isn’t needed for trivial tasks.
- Medium complexity (medium) – reasoning models win. As complexity increases, the reasoning models start to noticeably outperform their non-reasoning counterparts.
- High complexity (red) – both collapse. Once the curves enter the right-hand pink region, accuracy for both model types falls to nearly zero.
The Thinking Cliff
Let’s look at the bottom row of four graphs in the graphic below.

Each curve shows how many thinking tokens a reasoning model generated before committing to an answer.
- Low complexity: The lines rise quickly, brief chains-of-thought help the model double-check itself, yet the total token spend stays modest.
- Medium complexity: Thinking-token counts plateau in the low-thousands, exactly where the accuracy curves (top row) reach their peak.
- Approaching high complexity: As soon as those accuracy curves start to dive, the thinking-token lines kink sharply downward. The model still has ample context to keep reasoning, but it simply stops. That bend is the Thinking Cliff.
As tasks grow harder, reasoning models initially spend more thinking tokens while accuracy declines gradually, until a point where both effort and performance plunge at once.
Where answers live in the chain of thought
The researchers also did a dive into how the model reasons and the content of the reasoning traces themselves.

- Over-thinking on easy puzzles: In the less complex cases the green ✓ marks appear first (for Tower of Hanoi), but the model keeps chatting anyway, producing unnecessary reasoning tokens
- Late fixes in the medium band: At moderate complexity the pattern reverses: the first ideas are wrong, and the correct line of reasoning appears only near the end of the trace. That’s where the extra tokens finally pay off.
- No path on hard puzzles: Beyond the cliff the ✓ marks vanish; the model never reaches a valid solution
Handed the answer, lost in execution
Now for my favorite part of the paper. The researchers re-ran the Tower of Hanoi experiment, but provided the algorithm needed to solve the puzzle directly in the prompt. In theory, the model no longer had to plan, only follow instructions.
Did that lead to better performance? Nope!
Did the collapse in performance shift further along? Nope!

Performance generally stayed the same, and the collapse occurred at roughly the same point.
The failure mode shifts from searching for a plan to mis-executing a known plan, skipping moves, duplicating them, or halting early.
Alignment lens, when “thinking” hides shortcuts
It gets worse for reasoning models.
Not only do they “give up” on very hard puzzles, collapsing alongside non-reasoning baselines, but their reasoning itself can hide how they actually solve problems. Their final answer is not always faithful to the reasoning tokens used.
We covered this in a recent video, but Anthropic recently released a paper *Reasoning Models Don’t Always Say What They Think* that put this to the test.
The researchers slipped subtle hints into multiple-choice questions. Models used the hints to get the right answers, yet referenced them in fewer than 20 percent of their reasoning traces. The model found a shortcut (the hint), but doesn’t acknowledge it in its reasoning.
So between the two papers, we have two complementary failure modes
- Competence collapse (Apple paper) – accuracy nosedives once task depth crosses the Thinking Cliff.
- Faithfulness collapse (Anthropic paper) – accuracy stays high, but the stated reasoning omits the true rationale, making alignment and evaluation unreliable.
Both results underscore a common lesson: final answers are not enough.
Wrapping up
Whether you’re shipping LLM-based features to production or just casually using chatbots, keep these points in mind:
- Three-zone curve: Non-reasoning models win on easy puzzles, reasoning models take the lead at medium complexity, and both collapse once tasks cross some level of difficulty (the Thinking Cliff).
- Thinking Cliff: Reasoning-token counts rise, then drop, just before accuracy plunges. Additional reasoning tokens stop helping and the model effectively gives up.
- Execution still fails: Even when given the optimal Tower-of-Hanoi algorithm, reasoning models perform at the same level and fail at the same level of complexity.
- Not faithful: Anthropic shows models can use hints to get the right answer yet omit them from their chain-of-thought.
- Spend CoT where it pays off: Budget longer traces for medium-complexity tasks. Skip them on easy tasks.
- Log and audit reasoning traces: Don’t just look at final answers!