Google has released details of a new universal speech AI model that can understand hundreds of spoken words across over 300 languages.
The Universal Speech Model (USM) was trained on 12 million hours of speech and 28 billion sentences of text using a “continuous self-supervised learning and fine-tuning” approach. The model can currently do automatic speech recognition in 100 languages and achieve average word error rates of less than 30% across 73 languages, a “milestone never achieved before.”
The biggest challenge in automatic speech recognition is that conventional supervised learning methods are not scalable and time-consuming. Google’s self-supervised learning and fine-tuning approach allows the machine to monitor and learn at each stage without relying on humans. The learning process also makes it “effective in adapting to new languages and data,” according to the researchers.
The new model is intended to create captions for YouTube videos, bringing greater inclusion to “billions of people living in marginalized communities” worldwide, Google said. It is also a significant first step in the company’s mission to build an AI model that can handle 1,000 spoken words, announced last November.
One of the challenges of machine translation models is the need for a lot of data for training, making it difficult to develop tools for languages with few description examples online. Google’s USM recognized languages that weren’t widespread enough by pre-training the model’s encoder and “fine-tuning on a smaller labelled dataset.”