OpenAI has released the Whisper API along with ChatGPT API, an open-source speech-to-text model that enables robust transcription in multiple languages and translation from those languages into English. The model has been trained on 680,000 hours of multilingual and multitasking data collected from the web, resulting in better recognition of unique accents, background noise, and jargon.
The Whisper API can import files in M4A, MP3, MP4, MPEG, MPGA, WAV, and WEBM formats, costing $0.006 per minute. Unfortunately, the system has limitations, particularly in anticipating the “next word.” Because it was trained on a significant amount of noisy data, it may read words that aren’t uttered. Moreover, it does not perform equally well across languages and has greater error rates for speakers of languages that are underrepresented in the training data.
Despite these limitations, OpenAI believes that Whisper’s transcription capabilities can be used to improve existing apps, services, products, and tools. For instance, the AI-powered language learning app Speak already uses the Whisper API to power a new virtual conversational companion within the app.
This Whisper API release aligns with OpenAI’s goal to democratise AI and make it more accessible to more people. By making technology available to developers, they can create new apps and services to help people in various industries and sectors.
The Whisper API is a huge step in making AI-powered voice recognition and transcription available worldwide to developers. While it has limits, it may improve existing apps and services and lead to additional AI developments.