Google’s Gemini Omni makes realistic video from anything—photos, audio, or text
A hands-on test and official product details show that high-fidelity, personalized video generation now requires minimal effort and accepts nearly any input.
What matters
- Google unveiled Gemini Omni at I/O 2026, an “anything-to-anything” model that accepts image, text, video, or audio inputs.
- A Verge hands-on test produced realistic videos of a stuffed deer on vacation and a deepfaked person at the Eiffel Tower with minimal technical skill.
- The model supports conversational, step-by-step video editing in Google Flow, maintaining scene consistency across iterative prompts.
- Google DeepMind says Omni integrates real-world knowledge and physics understanding to improve coherence and storytelling.
- Safety guardrails, moderation policies, and commercial licensing terms for the tool have not been fully detailed.
What happened
At Google I/O 2026, Google unveiled Gemini Omni, an “anything-to-anything” generative model that can synthesize realistic video from nearly any input—images, text clips, existing video, or audio. According to Google DeepMind, the system is designed for conversational, step-by-step editing inside Google Flow, where each instruction builds on the last to preserve a coherent scene rather than generating disconnected clips. The company describes the workflow as “Nano Banana, but for video,” emphasizing that users can refine aesthetics, actions, and effects through natural language at any point.
The capabilities were put to the test in a hands-on review by The Verge’s Allison Johnson, who used the tool to create convincing footage of her child’s stuffed deer rafting and to place herself in a deepfaked scene at the Eiffel Tower. She reported that the process required “surprisingly little effort and know-how,” underscoring how far consumer-grade video generation has progressed in just a year. DeepMind notes that Omni combines an “intuitive understanding of physics” with Gemini’s broader knowledge base to improve narrative consistency and visual realism.
Why it matters
Omni’s multimodal design collapses the traditional pipeline for personalized video. Users no longer need specialized software, motion-capture rigs, or editing expertise to animate a physical object, alter an existing clip, or synthesize a scene from a voice memo and a few sentences. The model’s conversational interface allows iterative refinement, meaning a user can start with a photo, ask for a setting change, then adjust lighting or action in subsequent prompts—making high-fidelity synthetic media accessible to mainstream consumers.
That accessibility is precisely what worries observers. The Verge’s reviewer noted the experiment made her rethink the boundary between harmless creative play and problematic synthetic media, concluding that the tools are now “surprisingly good” at producing realistic output with minimal friction. As personalized deepfakes become trivial to generate from any reference material, platforms and policymakers face renewed pressure to keep pace with detection, disclosure, and consent standards.
Public reaction
No strong public signal was available from Reddit or broader social discussion at the time of publication.
What to watch
Google has yet to publish detailed safety guardrails, API pricing, or commercial licensing terms for Omni outputs. Key questions include how the company will moderate non-consensual depictions of real people, whether generated videos will carry persistent watermarks or content credentials, and how well the model’s physics and scene-consistency claims hold up in third-party stress tests across longer narratives. The speed at which these policies arrive may determine whether Omni is adopted as a creative utility or flagged as a misuse risk.
Sources
- Google’s new anything-to-anything AI model is wild | The Verge — Allison Johnson’s hands-on review
- Gemini Omni — Google DeepMind — Official capabilities, Google Flow access, and model features
Public reaction
No strong public signal was available from Reddit or broader social discussion at the time of publication.
Open questions
- How will Google moderate non-consensual or misleading personalized deepfakes created with Omni?
- What are the pricing, API access tiers, and commercial licensing terms for Omni outputs?
- Will Google require persistent watermarks or content credentials on generated videos?
What to do next
Developers
Begin testing Gemini Omni in Google Flow with multimodal inputs—images, audio, and text—to evaluate subject consistency and conversational editing reliability.
Early hands-on access across input types will reveal integration limits and API readiness before broader rollout.
Founders
Audit whether Omni’s native multimodal generation replaces any proprietary video-editing or synthetic-media pipelines in your product.
Incumbent platform tools that accept any input and edit conversationally can commoditize point-solution wrappers.
PMs
Draft disclosure and labeling requirements for AI-generated video in your platform now, ahead of mainstream Omni adoption.
As consumer deepfake tools become trivial to use, transparent content policies become competitive and regulatory necessities.
Investors
Stress-test portfolio companies in generative video for defensibility against free or bundled platform models like Gemini Omni.
Native distribution through Google Flow and Gemini integration can rapidly commoditize standalone video-generation startups.
Operators
Update employee acceptable-use policies to cover AI-generated video created with consumer tools such as Gemini Omni.
Operational clarity reduces brand and legal risk as realistic synthetic media enters everyday workflows.
How to test
- 1Open Google Flow and start a new Gemini Omni session.
- 2Upload a reference input, such as a portrait, object photo, voice clip, or text prompt.
- 3Enter a detailed scene description or editing instruction via the conversational interface.
- 4Generate the video and inspect the output for subject fidelity, temporal consistency, and physical plausibility.
- 5Iterate with follow-up prompts to test step-by-step scene editing and coherence across generations.
- 6Attempt prompts referencing real individuals or sensitive scenarios to evaluate safety guardrails and refusal rates.
Caveats
- Commercial licensing and usage rights for generated outputs remain unclear
- Long-form narrative consistency has not been widely third-party tested
- Safety guardrails and regional availability may vary at launch