Apple may have just made their biggest splash in AI and it has nothing to do with any software. Their recent paper, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” has been making waves because it delivers a blunt critique of today’s reasoning models.

The authors argue that the reason these models are so great on benchmarks comes from over-exposure to those benchmarks, not from the “reasoning”.

The TL;DR of their findings:

  • Non-reasoning models beat reasoning models on low-complexity tasks
  • Reasoning models beat non-reasoning models on medium-complexity tasks
  • Performance of both model types collapse at the same level of complexity

But there is so much more to it! Let’s jump in.

A new set of tests

The researchers decided to test reasoning and non-reasoning models on a fresh set of four logic puzzles, rather than typical math and coding benchmarks. They did this for two main reasons:

  • Benchmark leakage hides true capability: On standard sets like AIME 24 and AIME 25, accuracy shifts can be explained as much by training-set exposure as by genuine reasoning skill.
  • Math problems don’t let you turn the difficulty dial: Established benchmarks usually have a single answer. This makes it impossible to turn up the difficulty to see where performance begins to unravel.

The four puzzles

diagram of puzzles used in the experiments

  • Tower of Hanoi – move n disks across three pegs without violating size constraints.
  • Checker Jumping – swap two colored groups of checkers on a linear board using slide and jump moves.
  • River Crossing – ferry pairs of actors and agents across a river under safety and capacity constraints.
  • Blocks World – rearrange stacks of labelled blocks into a target pattern while obeying “top-block only” movement.

What these puzzles allow for

  1. Fine-grained complexity control: Each game can easily scale in difficulty.
  2. Minimal contamination risk: Especially compared to popular benchmarks, these exact puzzle instances are much less likely to appear in the pre-training process.
  3. Algorithmic solutions: Each puzzle has a set of rules and a known happy-path.

Performance across the three levels of complexity

Since each puzzle can easily scale in complexity (ex: adding another disk in the Tower of Hanoi puzzle), the researchers test all the models at varying levels of complexity. They test reasoning models and their non-reasoning counterparts. For example, Claude 3.7-Sonnet Thinking and Claude 3.7 Sonnet.

Eight charts comparing accuracy and complexity

The graphs are divided into three colored areas:

  1. Low complexity (yellow) – non-reasoning models win. On easier versions of each puzzle the non-reasoning models win, albeit by a close margin in most cases. Extra reasoning isn’t needed for trivial tasks.
  2. Medium complexity (medium) – reasoning models win. As complexity increases, the reasoning models start to noticeably outperform their non-reasoning counterparts.
  3. High complexity (red) – both collapse. Once the curves enter the right-hand pink region, accuracy for both model types falls to nearly zero.

The Thinking Cliff

Let’s look at the bottom row of four graphs in the graphic below.

Accuracy versus complexity across 8 graphs

Each curve shows how many thinking tokens a reasoning model generated before committing to an answer.

  • Low complexity: The lines rise quickly, brief chains-of-thought help the model double-check itself, yet the total token spend stays modest.
  • Medium complexity: Thinking-token counts plateau in the low-thousands, exactly where the accuracy curves (top row) reach their peak.
  • Approaching high complexity: As soon as those accuracy curves start to dive, the thinking-token lines kink sharply downward. The model still has ample context to keep reasoning, but it simply stops. That bend is the Thinking Cliff.

As tasks grow harder, reasoning models initially spend more thinking tokens while accuracy declines gradually, until a point where both effort and performance plunge at once.

Where answers live in the chain of thought

The researchers also did a dive into how the model reasons and the content of the reasoning traces themselves.

A variety of graphs showing when thinking models get correct and wrong answers in their reasoning tokens

  • Over-thinking on easy puzzles: In the less complex cases the green ✓ marks appear first (for Tower of Hanoi), but the model keeps chatting anyway, producing unnecessary reasoning tokens
  • Late fixes in the medium band: At moderate complexity the pattern reverses: the first ideas are wrong, and the correct line of reasoning appears only near the end of the trace. That’s where the extra tokens finally pay off.
  • No path on hard puzzles: Beyond the cliff the ✓ marks vanish; the model never reaches a valid solution

Handed the answer, lost in execution

Now for my favorite part of the paper. The researchers re-ran the Tower of Hanoi experiment, but provided the algorithm needed to solve the puzzle directly in the prompt.  In theory, the model no longer had to plan, only follow instructions.

Did that lead to better performance? Nope!

Did the collapse in performance shift further along? Nope!

Accuracy versus complexity when algorithm is shared in the prompt

Performance generally stayed the same, and the collapse occurred at roughly the same point.

The failure mode shifts from searching for a plan to mis-executing a known plan, skipping moves, duplicating them, or halting early.

Alignment lens, when “thinking” hides shortcuts

It gets worse for reasoning models.

Not only do they “give up” on very hard puzzles, collapsing alongside non-reasoning baselines, but their reasoning itself can hide how they actually solve problems. Their final answer is not always faithful to the reasoning tokens used.

We covered this in a recent video, but Anthropic recently released a paper *Reasoning Models Don’t Always Say What They Think* that put this to the test.

The researchers slipped subtle hints into multiple-choice questions. Models used the hints to get the right answers, yet referenced them in fewer than 20 percent of their reasoning traces. The model found a shortcut (the hint), but doesn’t acknowledge it in its reasoning.

So between the two papers, we have two complementary failure modes

  • Competence collapse (Apple paper) – accuracy nosedives once task depth crosses the Thinking Cliff.
  • Faithfulness collapse (Anthropic paper) – accuracy stays high, but the stated reasoning omits the true rationale, making alignment and evaluation unreliable.

Both results underscore a common lesson: final answers are not enough.

Wrapping up

Whether you’re shipping LLM-based features to production or just casually using chatbots, keep these points in mind:

  • Three-zone curve: Non-reasoning models win on easy puzzles, reasoning models take the lead at medium complexity, and both collapse once tasks cross some level of difficulty (the Thinking Cliff).
  • Thinking Cliff: Reasoning-token counts rise, then drop, just before accuracy plunges. Additional reasoning tokens stop helping and the model effectively gives up.
  • Execution still fails: Even when given the optimal Tower-of-Hanoi algorithm, reasoning models perform at the same level and fail at the same level of complexity.
  • Not faithful: Anthropic shows models can use hints to get the right answer yet omit them from their chain-of-thought.
  • Spend CoT where it pays off: Budget longer traces for medium-complexity tasks. Skip them on easy tasks.
  • Log and audit reasoning traces: Don’t just look at final answers!

Headshot of PromptHub Co-Founder Dan Cleary
Dan Cleary
Founder