Microsoft has recently announced a new speech synthesis artificial intelligence model called “VALL-E,” which can simulate a human voice recorded from a three-second audio sample. VALL-E sounds like OpenAI DALL-E, but the first one is for sound generation and the latter for image generation.
Once VALL-E has learned a specific voice, it can synthesize the voice of the person saying something and even reproduce the emotional tone of the speaker. This technology has the potential to be used in the creation of deep fakes. However, Microsoft has not released the VALL-E code, so its functionality cannot be tested.
Microsoft has referred to VALL-E as a “neural codec language model,” based on a technology called EnCodec that Meta announced in October 2022. Unlike other speech synthesis methods that manipulate waveforms to synthesize speech, VALL-E generates separate speech codec codes from text and audio prompts. It analyzes a human voice, breaking the information into discrete elements called “tokens” by EnCodec. It uses training data to determine what occurs when the voice speaks a phrase other than the three-second sample.
VALL-E’s speech synthesis function was trained using a speech library called LibriLight, created by Meta and contains 60,000 hours of English audio from over 7,000 speakers, centred around LibriVox public domain audiobooks. The more data that VALL-E has, the better it will perform. Microsoft has provided numerous voice samples of the AI model on its VALL-E sample site, and some of the results sound like human voices.
In addition to reproducing the timbre and emotional expressions of a speaker’s voice, VALL-E can also mimic the “acoustic environment” of a sample voice. For example, if the sampled voice is from a phone call, the synthesized voice output will simulate the acoustic and frequency characteristics of the phone, creating the sound as if you were speaking through the receiver. VALL-E can also generate infinite timbre variations by changing the random seed used in the generation process.
VALL-E can be used to create high-quality text-to-speech applications and voice recordings, as it can be combined with other generative AI models such as GPT-3. However, there is a risk of misuse of the model, such as spoofing speech identification or impersonating a specific speaker. To mitigate this risk, Microsoft plans to build a detection model to determine if an audio clip was synthesized with VALL-E and will continue to follow the Microsoft AI Principles as the model is developed further.
Despite its impressive capabilities, the potential for misuse of VALL-E has caused some concern. n the wrong hands, this technology could create deep fakes or impersonate someone’s voice maliciously. For example, an individual could use VALL-E to create a voice recording that appears to be from a celebrity, politician, or another public figure, to spread disinformation or to cause confusion.
Despite these precautions, VALL-E and other similar technologies will likely continue to raise ethical and legal questions as they become more widespread. As with any new technology, it will be critical to weigh the dangers and advantages and develop suitable norms and laws to guarantee that these tools are utilized appropriately.
Recently a group of researchers from MIT developed an AI model, Speech2Face, that can predict a person’s face just by listening to their voice.