Emu Video is a groundbreaking tool dedicated to text-to-video generation, employing explicit image conditioning to achieve exceptional results. It leverages diffusion models to divide the generation process into two steps: first, generating an image based on a text prompt, and then creating a video using both the prompt and the generated image.
This innovative approach allows for the efficient training of top-tier video generation models. Unlike previous methods that require a deep cascade of models, Emu Video achieves remarkable results with just two diffusion models, capable of generating 512px, 4-second-long videos at 16fps.
In comparison to other models such as Make-a-Video (MAV), Imagen-Video (IMAGEN), Align Your Latents (AYL), Reuse & Diffuse (R&D), Cog Video (COG), Gen2 (GEN2), and Pika Labs (PIKA), Emu Video consistently delivers state-of-the-art performance in text-to-video generation.
Human raters have unanimously selected Emu Video’s videos as the most convincing, highlighting their exceptional quality and faithfulness to the provided prompt.
Emu Video is authored by a team including Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra, with Rohit Girdhar and Mannat Singh contributing equally to the technical development.
Acknowledging the support of various collaborators who contributed data and infrastructure, Emu Video prioritizes user privacy, maintaining transparent privacy and cookie policies accessible on their website.