New research from Apple indicates that advanced AI models, particularly large reasoning models (LRMs), suffer a "complete accuracy collapse" when confronted with highly complex problems. This challenges the prevailing assumptions about AI capabilities and raises significant questions about the industry's pursuit of artificial general intelligence (AGI).
AI's Reasoning Limitations Exposed
Apple's study, published on its Machine Learning Research website, reveals that while LRMs attempt to solve complex queries by breaking them down into smaller steps, their performance drastically declines with increasing problem complexity. Standard AI models even outperformed LRMs in low-complexity tasks. The researchers found that as LRMs approached performance collapse, they counterintuitively reduced their "reasoning effort," a finding described as "particularly concerning."
The Illusion of General Intelligence
Gary Marcus, a prominent AI academic, described the Apple paper as "pretty devastating," suggesting it casts doubt on the rapid pursuit of AGI. He argues that large language models (LLMs), which underpin tools like ChatGPT, are not a direct route to AGI that could fundamentally transform society. The study highlights that current AI approaches may have reached fundamental barriers to generalisable reasoning, where an AI model can apply a narrow conclusion more broadly.
Key Takeaways
- Advanced AI models, including large reasoning models (LRMs), experience a "complete accuracy collapse" when faced with highly complex problems.
- LRMs surprisingly reduce their reasoning effort as problems become more difficult, indicating a fundamental scaling limitation.
- The findings challenge the notion that current LLMs are a direct path to artificial general intelligence (AGI).
- The study suggests that the AI industry's current approach may be in a "cul-de-sac" regarding generalisable reasoning.
Hallucinations and Computational Waste
The research also found that reasoning models can waste computing power by finding correct solutions for simpler problems early on, but then explore incorrect solutions for slightly more complex tasks before arriving at the right answer. For highly complex problems, the models failed entirely, even when provided with a solution algorithm. This aligns with observations that reasoning models are more prone to "hallucinations" – generating erroneous or nonsensical responses – than their generic counterparts, a problem that appears to worsen as models advance.
Puzzles and Performance
To assess the models, researchers used classic puzzles such as the Tower of Hanoi and River Crossing, adjusting their complexity. While generic models had an edge in low-complexity tasks, and reasoning models gained an advantage in medium complexity, both types of models saw their performance "collapse to zero" with high-complexity puzzles. This indicates that current AI models rely more on pattern recognition than emergent logic, raising questions about their true understanding and reasoning capabilities.