Research by Apple Indicates ChatGPT and Comparable Chatbots Do Not Possess Genuine Reasoning Skills

Research by Apple Indicates ChatGPT and Comparable Chatbots Do Not Possess Genuine Reasoning Skills

Research by Apple Indicates ChatGPT and Comparable Chatbots Do Not Possess Genuine Reasoning Skills


## Is AI Capable of Real Reasoning? New Research from Apple Questions Generative AI’s Abilities

In recent times, organizations such as OpenAI and Google have been leading the advancement of generative AI, declaring that a significant breakthrough in AI technology is imminent. Take OpenAI’s **ChatGPT o1-preview** enhancement, which aims to demonstrate the future potential of AI by incorporating reasoning skills. This latest iteration, accessible to ChatGPT Plus and other premium users, is claimed to tackle intricate queries that necessitate advanced reasoning, thus making it more adept at addressing complicated issues.

Nonetheless, a recent examination by **Apple’s AI research team** raises skepticism regarding these assertions. Their research indicates that generative AI models, including ChatGPT o1 and others, may not genuinely “reason” like humans do. Rather, they depend on pattern recognition from their training datasets, which can create the illusion of reasoning but fails when confronted with specific types of challenges.

### The Essence of Apple’s Study: Is AI Truly Capable of Reasoning?

Apple’s research, published as a pre-print on [arXiv](https://arxiv.org/pdf/2410.05229), investigates whether large language models (LLMs) such as ChatGPT, GPT-4o, and others can authentically reason through issues or if they simply replicate reasoning based on patterns they have processed during training.

The researchers undertook a battery of tests employing both open-source models such as **Llama**, **Phi**, **Gemma**, and **Mistral**, alongside proprietary models including **ChatGPT o1-preview**, **o1 mini**, and **GPT-4o**. The consistent outcome indicated that these models do not genuinely engage in reasoning. Instead, they strive to mimic reasoning methods observed in their training data, which can result in mistakes when even slight alterations in problem structure are introduced.

### The GSM-Symbolic Benchmark: An Innovative Assessment for AI Reasoning

To evaluate the reasoning capabilities of these models, Apple’s researchers created a revised iteration of the **GSM8K** benchmark, which comprises more than 8,000 elementary math word problems frequently used to assess AI models. Their adjusted version, termed **GSM-Symbolic**, involved minor alterations to the math problems, such as modifying character names, relationships, or numbers—adjustments that should not influence the logical integrity of the problem.

For instance, in one problem, the character “Sophie” must count toys. Changing her name or altering the numbers should not confuse a reasoning-based AI model. After all, a primary school student would still solve the problem despite these trivial adjustments.

However, the research demonstrated that these seemingly insignificant changes considerably affected the performance of the AI models. The overall accuracy dipped by up to **10%** across all models when confronted with the GSM-Symbolic test. Some models performed better than others; for example, **GPT-4o** experienced a decline in accuracy from 95.2% on the original GSM8K test to 94.9% on the GSM-Symbolic version.

### The No-Op Test: A Closer Examination of AI’s Shortcomings

Apple’s researchers progressed beyond simple modifications. They also introduced a novel test named **GSM_NoOp**, wherein irrelevant clauses were added to math problems—statements that appeared related but did not aid in solving the problem. The objective was to observe whether AI models could disregard these distractions and concentrate on the core reasoning needed to solve the problem.

The outcomes were revealing. Numerous AI models found it challenging to manage these extraneous additions, further indicating that they do not genuinely grasp the fundamental logic behind the problems they solve. Instead, they seem to follow learned patterns during training, and when those patterns are altered, their performance declines.

### Implications for the Future of AI

The conclusions of Apple’s study provoke significant inquiries regarding the future of generative AI and its aptitude for executing tasks that necessitate authentic reasoning. While models like ChatGPT o1-preview are undeniably potent and can yield remarkable results in numerous scenarios, their dependence on pattern recognition instead of true reasoning could restrict their effectiveness in more complicated, real-world applications.

This does not imply that generative AI lacks value—quite the opposite. These models remain immensely beneficial for various applications, ranging from content creation to customer support. However, recognizing their limitations is vital, and the quest for advancing what AI can accomplish must persist.

### Final Remarks: The Future Path for AI Reasoning

As companies such as OpenAI, Google, and Apple continue to innovate and refine their AI technologies, the matter of whether AI can truly reason remains unresolved. Apple’s study implies that we are not quite there yet—current models excel in pattern recognition but fall short in authentic logical reasoning.

At present, it appears that AI’s reasoning capabilities are still developing.