Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction
Having powerful Large Language Models (LLMs) right in your pocket is now a reality with on-device models like Gemini Nano and Gemma. This technology enables everyday features on your phone — such as instantly summarizing a flurry of notifications or proofreading an important text message — all without sending your private data off device. But to make these features useful for everyday users, they need to happen very efficiently.
Delivering this kind of speed on a mobile device is a significant challenge. Unlike vast server environments, mobile phones operate under a strict energy budget and hard memory (RAM) limits. Furthermore, standard language models generate text “autoregressively” — meaning they process and output just one word (or token) at a time. This step-by-step process creates a bottleneck, underutilizing the phone’s processing power while straining its memory bandwidth, which can ultimately slow down the user experience and drain the battery.
To overcome this bottleneck, we are announcing a new architecture that retrofits Multi-Token Prediction (MTP) onto existing, “frozen” Gemini Nano v3 models. Building on prior approaches like the EAGLE framework and Confident Adaptive Language Modeling (CALM), we designed new architectural components to maximize these efficiency gains specifically for mobile environments. Our recent announcements highlighted accelerating Gemma 4 with MTP and making it available to developers.
Today’s article tackles the unique, extreme constraints of edge computing. Recently rolled out to the Pixel 9 and 10 series, this approach acts as an out-of-the-box speedup. For users, this means that features like AI Notification Summaries and Proofread generate text significantly faster and with less energy consumption. For developers, it eliminates a major friction point: delivering high-speed on-device AI without the need to fine-tune separate, memory-heavy drafting models for every new task.

