**Generative Image Models: The Emergence of Normalizing Flows**
In the present day, generative image models predominantly categorize into two types: diffusion models, such as Stable Diffusion, and autoregressive models, like OpenAI’s GPT-4o. Nonetheless, recent findings from Apple indicate a renewed interest in a third method: Normalizing Flows (NFs). By incorporating Transformers, these models may demonstrate abilities that were previously underappreciated.
### What are Normalizing Flows?
Normalizing Flows are AI frameworks that convert real-world data, including images, into structured noise and subsequently reverse the process to produce new samples. One considerable advantage of NFs is their capability to compute the precise likelihood of generated images, a trait that diffusion models do not possess. This characteristic renders NFs especially useful for tasks where grasping the probability of outcomes is vital. Despite this potential, initial flow-based models frequently yielded blurred images, lacking the intricacy and diversity found in diffusion and transformer-based systems.
### Study #1: TarFlow
In the publication “Normalizing Flows are Capable Generative Models,” Apple unveils TarFlow, or Transformer AutoRegressive Flow. TarFlow substitutes conventional handcrafted layers in flow models with Transformer blocks, creating images in small segments and predicting each segment based on prior ones. This autoregressive technique parallels the method utilized by OpenAI for image generation.
A major difference is that whereas OpenAI generates discrete tokens, aproaching images as sequences of text, TarFlow produces pixel values directly. This method circumvents the quality decline linked with encoding images into a fixed vocabulary of tokens. However, TarFlow encountered difficulties in scaling to larger, high-resolution images.
### Study #2: STARFlow
Building upon TarFlow, Apple introduces STARFlow (Scalable Transformer AutoRegressive Flow) in the paper “STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis.” The main innovation of STARFlow is its functionality in latent space instead of pixel space, enabling it to emphasize broader image structures before fine-tuning details via a decoder.
Moreover, STARFlow improves text prompt management by incorporating pre-existing language models, such as Google’s Gemma, to understand user prompts. This configuration keeps image generation concentrated on visual specifics while utilizing advanced language comprehension.
### Comparing STARFlow with OpenAI’s GPT-4o
While Apple investigates Normalizing Flows, OpenAI has progressed its GPT-4o model beyond diffusion. GPT-4o regards images as sequences of discrete tokens, predicting one token at a time, allowing for versatility across text, images, and audio. However, this token-by-token generation can be time-consuming and resource-intensive, particularly for high-resolution images.
Conversely, Apple’s methodology with STARFlow emphasizes efficiency and portability, indicating an emphasis on mobile applications. Both companies are innovating beyond diffusion models, but their approaches reflect differing priorities—OpenAI for cloud capabilities and Apple for on-device performance.