Meta AI has announced the development of a new AI model called Voicebox. This cutting-edge model is specifically designed to assist creators with various audio generation tasks, including audio editing, sampling, and styling. Unlike previous models, Voicebox does not require in-context learning and offers a range of functions that cater to the needs of creators worldwide.
Voicebox offers numerous features aimed at enhancing the audio creation process. It enables creators to generate high-quality voices, edit recorded voices, and correct mistakes while preserving the content and style of the original audio. Voicebox can even remove unwanted sounds like background noise or air conditioning hum. Additionally, it supports voice conversion in six different languages. Meta envisions future applications where Voicebox provides natural voices to visual assistants and non-player characters in metaverse games.
Meta has compared Voicebox to existing voice AI models, specifically mentioning Vall-E and YourTTS. By evaluating word error rates and style similarity, Meta demonstrates that Voicebox surpasses both models in terms of performance.
Voicebox is built on Meta’s state-of-the-art non-autoregressive generative model called the Flow Matching model. This model excels in learning highly non-deterministic mappings between text and speech. Consequently, Voicebox can handle a wide range of voice generation tasks without relying on carefully labelled data, thus enabling the utilization of diverse and large-scale datasets.
To date, Voicebox has been trained on over 50,000 hours of recorded audio and transcripts from public-domain audiobooks in multiple languages, including English, French, Spanish, German, Polish, and Portuguese. It has successfully learned to predict speech segments based on ambient audio and its corresponding transcripts.
While the development of Voicebox opens new possibilities in generative AI for speech, Meta acknowledges the potential for misuse and unintended harm. As a precautionary measure, Meta has built a highly effective classifier that can discern between real speech and voicebox-generated speech. The research paper accompanying Voicebox’s release will provide detailed insights into the development of this classifier.
Meta has decided not to release the Voicebox AI program or its source code to the public. However, interested users can explore a demo of Voicebox on Meta’s dedicated page.