In a recent investigation, Apple researchers unveil a groundbreaking method to enhance open-source models’ abilities in producing user interface (UI) code with SwiftUI. The study, presented in the paper titled “UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback,” underscores the difficulties that large language models (LLMs) encounter in generating syntactically correct and aesthetically pleasing UI code.
The researchers pinpointed a major concern: even within curated datasets, instances of UI code are alarmingly scarce, often making up less than one percent of all code samples. To tackle this, they employed StarChat-Beta, an open-source LLM geared towards coding, and supplied it with a compilation of UI descriptions. The model was instructed to create a vast synthetic dataset of SwiftUI programs based on these descriptions.
Every generated code segment underwent a stringent validation process, where it was compiled using a Swift compiler to verify its functionality. After compilation, the results were scrutinized by GPT-4V, a vision-language model, which contrasted the generated interfaces with the original descriptions. Outputs that failed to compile, seemed unrelated, or were duplicates were eliminated, resulting in a high-caliber training set for refining the model.
This iterative process was conducted numerous times, producing ongoing enhancements in the model’s capacity to generate SwiftUI code. After five cycles, the researchers compiled nearly one million SwiftUI programs and developed a model called UICoder, which showed a significant enhancement in generating interfaces that closely aligned with the prompts compared to the original model.
Evaluations revealed that UICoder significantly surpassed the base StarChat-Beta model in both automated metrics and human assessments. Importantly, UICoder reached a quality level near that of GPT-4 while exceeding it in successful compilation rates.
A fascinating element of the study was the discovery that the original dataset employed to train StarChat-Beta unintentionally excluded SwiftUI code. The training data mainly comprised three sources: TheStack, a substantial dataset of permissively licensed code repositories; crawled web pages; and OpenAssistant-Guanaco, a smaller instruction-tuning dataset. The researchers determined that Swift code repositories were inadvertently left out of TheStack, and the OpenAssistant-Guanaco dataset included merely one example of Swift code amidst thousands.
This oversight implied that UICoder’s advancements were not based on existing SwiftUI instances but rather stemmed from the self-generated, curated datasets produced through the automated feedback loop established by the researchers. They proposed that this effective strategy for implementing UIs using SwiftUI could potentially generalize to other programming languages and UI frameworks.
The complete study, “UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback,” is accessible on arXiv for further exploration.