OpenAI has introduced Sora, a new model capable of converting text prompts into videos up to a minute long. This new model maintains high visual quality and closely adheres to the user’s input.
Sora is engineered to comprehend and simulate complex scenarios, including scenes with multiple characters, specific motions, and detailed backgrounds. It accurately interprets user prompts, ensuring consistency in characters and visual style throughout the video. A remarkable feature of Sora is its ability to animate still images and fill in or extend missing frames in videos, demonstrating its versatility and precision in handling visual data.
Building on the foundation laid by its predecessors, DALL·E and GPT models, Sora incorporates the recaptioning technique from DALL·E 3. This approach involves generating descriptive captions for visual training data, enhancing the model’s ability to understand and generate content.
Despite its impressive capabilities, OpenAI acknowledges Sora’s limitations, including challenges in simulating complex scene physics and occasional confusion over spatial details in prompts. To address potential risks associated with the model, OpenAI is engaging with red teamers to assess and mitigate harms. The organization is also developing tools to detect misleading content generated by Sora and plans to include metadata in outputs for greater transparency.
Initially, Sora will be available to red teamers and select creative professionals, with OpenAI aiming to refine the model based on diverse user feedback.
The team behind Sora includes Tim Brooks and Bill Peebles, research scientists at OpenAI, and Aditya Ramesh, the creator of DALL·E and head of videogen. Their leadership and innovation have been crucial in developing Sora.
The release of Sora coincides with Google’s announcement of Lumiere, a text-to-video diffusion model, and Gemini 1.5, a model that surpasses existing natural processing capabilities.