Google’s Gemini Omni transforms images, audio, and text into video — and that’s just the beginning

Three years ago, Google set out to develop a multimodal large language model called Gemini, capable of processing text, image, audio, and video inputs to generate content across these formats. Today, during the Google I/O developer conference, CEO Sundar Pichai introduced Gemini Omni, a new suite of multimodal models poised to “create anything from any input.”

Omni initially focuses on video. Users can blend images, audio, video, and text, with Omni integrating these inputs for coherent outputs. This results in high-quality videos that are informed by an understanding of physics, culture, history, and science. Additionally, Omni enables photo editing via plain text commands, similar to Google’s Nano Banana.

Google’s existing video model, Veo, has turned text and images into videos while allowing customization through prompts. However, Nicole Brichtova from Google DeepMind emphasizes that the current release transcends a mere Veo update, integrating Gemini’s intelligence with advanced media model rendering.

During a media briefing, Koray Kavukcuoglu from DeepMind demonstrated Omni’s capabilities with a prompt like “a claymation explainer of protein folding,” which quickly rendered a stop-motion video with educational narration.

The broader vision for Omni involves generating content like images from audio and audio from video. Pichai highlighted that Gemini, the first native multimodal AI model, was trained across various data types, advancing from predictive text to reality simulation. Gemini Omni represents further progress in that direction.

As part of the release, users can create videos with digital avatars, a feature OpenAI previously popularized with its defunct Sora app. To prevent deepfakes, a specific onboarding process involving self-recording is required, with avatars stored for future use. All Omni-generated videos will include Google’s SynthID digital watermark for verification.

Launching today, the Gemini Omni Flash model will be accessible through the Gemini app, YouTube Shorts, and AI creative studio Flow, rendering videos up to 10 seconds long. This duration is not a model limitation but a strategic choice to engage users, with plans for extended durations in the future.

Omni Flash is marketed as a consumer tool, offering features like personalized digital avatars for creating videos of personal achievements or vacations, as described by Brichtova and Gabe Barth-Maron from DeepMind. However, specific editing prompts are crucial to avoid over-editing errors, reminiscent of issues faced by Nano Banana users.

Although the immediate focus is on consumer applications, Omni’s potential in enterprise and creative sectors is evident, with Google planning API availability soon. The avatar-generation tool, already on Shorts, is expected to appeal to content creators, while a seamless multimodal workflow could revolutionize advertising and filmmaking.

In parallel, startup Luma AI is developing an agentic tool to generate ad campaigns from brief inputs, backed by its “unified” model. Brichtova expressed pride in the model’s text-rendering capabilities, promising accuracy crucial for advertising. They anticipate adoption by filmmakers and other creators.

For professional use, the more advanced Omni Pro model is expected to perform better on Omni tasks. While no release date for Pro has been set, Brichtova indicated its availability once it represents a significant advancement over Flash.