A recent research paper from Apple outlines a method that enhances the speed of large language model responses while maintaining output quality. Here are the specifics.
## The technical details
Typically, LLMs formulate text one token at a time. This process is slow, as each iteration relies on all preceding ones to maintain coherence and correctness in output.
For instance, if the model is constructing a sentence such as “The cat is black,” it forecasts each token sequentially. After producing “The cat is,” it evaluates everything generated thus far (along with the user’s input and insights gained during training) to assess the likelihood of each possible subsequent token in its lexicon. This is known as autoregression.
In this case, it might evaluate choices like black, tall, sleeping, grumpy, fluffy, skinny, purring, white, tired, playing, missing, meowing, cold, etc., then select the one that best corresponds to the context.
## What Apple accomplished
In the research [Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential](https://arxiv.org/abs/2507.11851), Apple’s team discovered that while these models are typically trained to predict only the next token, they still possess valuable information pertaining to several forthcoming tokens.
Building on this insight, they created a “multi-token prediction” (MTP) framework that enables the model to generate multiple tokens simultaneously.
If this feels reminiscent of the [diffusion model study](https://9to5mac.com/2025/07/04/apple-just-released-a-weirdly-interesting-coding-language-model/) we discussed a few weeks back, you’re not mistaken. Although the training methodologies and underlying technologies are distinct, both strategies aim to accelerate inference and reach outcomes more swiftly than the one-token-at-a-time method.
In this particular investigation, researchers integrated special “mask” tokens into prompts, which serve as placeholders for the forthcoming words.
For example, “The cat is ” could be completed as “very fluffy” in one single step. While generating output, the model anticipates multiple upcoming words simultaneously, with each word being promptly checked against what traditional autoregressive decoding would have yielded. If a prediction fails the verification, it defaults back to the standard one-at-a-time method. Overall, this guarantees enhanced speed without compromising accuracy.
During tests with the open-source Tulu3-8B model, Apple trained it to speculate on 8 additional tokens and noted average speed improvements of 2–3× across general tasks such as Q&A and chat, and up to 5× in more predictable areas like coding and mathematics. These enhancements were achieved with “no degradation in generation quality, thanks to a straightforward yet efficient technique we name gated LoRA adaptation.”
You can access the complete paper on [arXiv](https://arxiv.org/abs/2507.11851).