Meta Showcases New ‘Voicebox’ Speech-to-Text Translation Tool

On the surface at least, Meta’s latest AI advancement doesn’t seem like a major step.

Today, Meta has published an overview of its new ‘Voicebox’ AI system, which will enable users to translate text to audio, in a range of styles and voices.

Introducing Voicebox, a new breakthrough generative speech system based on Flow Matching, a new method proposed by Meta AI. It can synthesize speech across six languages, perform noise removal, edit content, transfer audio style & more.

More details on this work & examples ⬇️

— Meta AI (@MetaAI) June 16, 2023

As presented in this overview clip, the Voicebox system can take text inputs and translate them into audio, with different voice options, enabling more advanced text-to-audio translation, but with reduced learning and processing requirements than other, similar offerings.

Though, on the surface at least, it’s not a heap different from the text-to-audio tools that we’re now accustomed to – whether we like them or not – on TikTok and other apps.

The Voicebox translations sound pretty similar – and I’m willing to bet Meta won’t let me use the voice of Rocket Raccoon or a Transformer in these new translations.

But the Voicebox system is also more than just a direct text-to-speech translation tool.

As explained by Meta:

“Voicebox can produce high quality audio clips and edit pre-recorded audio – like removing car horns or a dog barking – all while preserving the content and style of the audio. The model is also multilingual and can produce speech in six languages. In the future, multipurpose generative AI models like Voicebox could give natural-sounding voices to virtual assistants and non-player-characters in the metaverse. They could allow visually impaired people to hear written messages from friends read by AI in their voices, give creators new tools to easily create and edit audio tracks for videos, and much more.”

As Meta notes, Voicebox also enables you to use models of voice for translation, so you can use an audio clip of another person in order to make your text-to-speech translation sound like that person is speaking, via just seconds of audio input.

Which will undoubtedly lead to a new raft of deepfakes – though again, similar tools do already exist. They’re just not the same, and Meta says not as good, as this new process.

The real benefit of Voicebox, in a broad-reaching sense, will be in translation, and enabling simplified, native-sounding variations of your text inputs in different languages. That could open up new, cross-market opportunities, while the advanced modeling of the system will also facilitate broader use cases and process, which could provide other key benefits.

But Meta is also aware of the risks.

At this stage, Meta isn’t releasing the source code or app to the public, citing ‘the potential risks of misuse’. It’s hoping to find more practical, valuable use cases for the technology over time – so its announcement today is more of an FYI than a launch, as such.

You can read more about Meta’s Voicebox project here.