NVIDIA has introduced its latest AI text-to-video model called VideoLDM. Developed in collaboration with Cornell University researchers, the model can generate videos with a resolution of up to 2048 × 1280 pixels, at a frequency of 24 frames and a duration of up to 4.7 seconds, based on a text description.
The model is based on the Stable Diffusion neural network and has up to 4.1 billion parameters, but only 2.7 billion are used for video training. With an efficient Latent Diffusion Model (LDM) approach, developers have created diverse and time-consistent high-definition videos with very high quality.
The model also boasts personalized video generation and convolutional synthesis in time, with temporal layers trained in VideoLDM to turn text into video. The researchers inserted these temporal layers into LDM image reference networks that are fine-tuned in advance in the DreamBooth image set. This allows for personalized text-to-video conversion, and the learned temporal layers are applied convolutionally over time to produce slightly longer clips with little degradation in quality.
VideoLDM can also generate videos of driving scenes, with a resolution of 1024 × 512 pixels and up to 5 minutes long. The model can simulate specific driving scenarios using bounding boxes to create an interesting environment and generate believable videos. It can also make multimodal motion scenario predictions by generating multiple plausible deployments based on a single initial frame.
This research paper is a participant of the Machine Vision and Pattern Recognition Conference in Vancouver from June 18 to 22. Although the presented neural network is only a research project at this time, it is an impressive demonstration of the advancements in AI technology.
NVIDIA has achieved noticeable success in video quality relative to a text query in just a month of testing VideoLDM inside the company. While it is unknown when the model will be released to the public, examples of VideoLDM’s capabilities are available on the NVIDIA website.