Apple Researchers Develop On-Device AI Assistant for Application Engagement

Apple Researchers Develop On-Device AI Assistant for Application Engagement

3 Min Read


**Ferret-UI Lite: A Milestone in On-Device GUI Engagement**

With only 3 billion parameters, Ferret-UI Lite either matches or exceeds the benchmark performance of models that are up to 24 times larger. Here are the specifics.

### A Brief Background on Ferret

In December 2023, a group of 9 researchers released a paper titled “FERRET: Refer and Ground Anything Anywhere at Any Granularity.” Within it, they introduced a multimodal large language model (MLLM) capable of comprehending natural language references to specific components of an image. Since then, Apple has published subsequent papers that broaden the Ferret model family, including Ferretv2, Ferret-UI, and Ferret-UI 2.

Ferret-UI variants particularly built upon the original FERRET functionalities, remedying the limitations of general-domain MLLMs. The foundational Ferret-UI paper underscored that despite significant progress in MLLMs, they frequently have difficulty effectively understanding and interacting with user interface (UI) displays. Ferret-UI was created to better interpret mobile UI screens, including “any resolution” to enhance details and utilize advanced visual features.

### Ferret-UI Lite

Recently, Apple enriched the Ferret-UI lineup with a paper titled “Ferret-UI Lite: Insights from Developing Compact On-Device GUI Agents.” Ferret-UI was founded on a 13B-parameter model aimed at mobile UI comprehension and fixed-resolution screenshots. In contrast, Ferret-UI Lite is a streamlined model crafted to operate on-device while competing with much larger GUI agents.

The researchers observed that traditional GUI agent methodologies often depend on extensive foundation models because of their robust reasoning and planning abilities. However, these models are usually too substantial and resource-intensive for effective on-device functionality. Consequently, they created Ferret-UI Lite, a variant with 3 billion parameters, comprising crucial elements driven by insights on training compact language models.

Ferret-UI Lite harnesses:

– Authentic and synthetic training datasets from various GUI sectors.
– On-the-fly cropping and zooming techniques to enhance understanding of specific GUI segments.
– Supervised fine-tuning and reinforcement learning strategies.

The outcome is a model that either closely resembles or even surpasses rival GUI agent models that possess up to 24 times its parameter count.

### Innovative Techniques

The infrastructure of Ferret-UI Lite encompasses real-time cropping and zooming methods. The model makes an initial prediction, crops around that prediction, and subsequently re-predicts within the cropped area, compensating for its restricted capacity to process extensive numbers of image tokens.

Another significant advancement is the capability to generate its own training data. The researchers established a multi-agent system that directly interacts with live GUI platforms to create synthetic training examples on a large scale. This system captures the nuances of real-world interaction, including mistakes and unforeseen states, which are hard to replicate with pristine, human-annotated data.

Notably, while Ferret-UI and Ferret-UI 2 evaluated themselves with iPhone screenshots and other Apple interfaces, Ferret-UI Lite was trained and assessed in Android, web, and desktop GUI settings, employing benchmarks such as AndroidWorld and OSWorld. This decision likely mirrors the accessibility of reproducible, large-scale GUI-agent test scenarios.

### Performance and Applications

The researchers discovered that although Ferret-UI Lite excelled in short-horizon, low-complexity tasks, it did not perform as robustly in more intricate, multi-step interactions, a trade-off anticipated due to the limitations of a compact, on-device model. However, Ferret-UI Lite provides a local and private agent capable of autonomously interacting with app interfaces based on user demand, thereby enhancing user experience without relying on cloud processing.

To explore further about the study, including detailed benchmark analyses and outcomes, follow this link.

You might also like