Tech
Google takes next big step towards AGI, launches Gemini Omni: What is it, how it works and more
Google has launched Gemini Omni, a new family of artificial intelligence (AI) models designed to merge advanced text reasoning with multimedia creation. The model family is built to accept any combination of text, images, audio and video as an input prompt to generate and edit high-quality video content even as it moves toward its ultimate goal of creating Artificial General Intelligence (AGI). The step in creating a world model is Gemini Omni Flash – a tool that is being integrated directly into the Gemini app, Google Flow, and YouTube Shorts. Google confirmed that while the model family is starting with video production, it will expand to support direct image and audio generation in the future.
How Gemini Omni works
Gemini Omni operates as a multimodal engine, meaning it processes different types of media simultaneously rather than converting them into text first. According to Google’s announcement, the model functions across three core capabilities:
Conversational video editing
Instead of using traditional timeline-based video editing software, users can modify video clips by typing or speaking instructions in natural language. The system, which is designed to remember the context of previous instructions across multiple turns of conversation, will use the data to create an output that maintains character consistency, preserves environmental details and tracks camera angles. Users can instruct the tool to alter specific objects, replace entire backgrounds, inject new characters, or completely transform the visual style of a video they have uploaded – something like CEO Sundar Pichai shared in a post on X ahead of Google I/O 2026.
Integration of real-world knowledge and physics
Google stated that Gemini Omni goes beyond simple visual pattern matching by calculating the underlying physics of a scene. The model features an updated understanding of physical forces such as gravity, fluid dynamics and kinetic energy to make generated motion look realistic.The model can synthesize a single video from multiple distinct source materials. Users can upload an image of a character, a text description of a setting, and a video clip showing a specific art style, and Gemini Omni will blend these references into a unified video. Moreover, if users want, they can include generate videos featuring a digital version of themselves that looks and sounds like them, using their own voice. Addressing safety concerns regarding deepfakes and automated misinformation, Google noted that it is withholding broader video speech-and-audio editing features from the public while it conducts further testing.Furthermore, all video content generated by the Omni models will automatically embed SynthID, a digital watermark developed by Google DeepMind. These watermarks cannot be seen by the human eye but allow users to verify if a video was generated by the AI through Google Search, Gemini in Chrome, or the Gemini app.
Gemini Omni availability and rollout schedule
Google has established a tiered rollout schedule for Gemini Omni Flash starting this week:Premium subscribers: Available immediately to all Google AI Plus, Pro, and Ultra subscribers globally via the Gemini app and Google Flow.General public: Rolling out at no cost to standard users within YouTube Shorts and the YouTube Create app over the course of the week.Developers and enterprise: Access will expand to corporate clients and external developers via application programming interfaces (APIs) in the coming weeks.