Meta on Tuesday released a multimodal AI foundational model called SeamlessM4T that’s designed for translating and transcribing speech and text.

The machine-learning model can perform automatic speech recognition, accepting either spoken or text input and returning either format; that is to say it can translate from one language to another as well as transcribe. Being able to handle each of these modes is what makes the model truly multimodal.

“Building a universal language translator, like the fictional Babel Fish in The Hitchhiker’s Guide to the Galaxy, is challenging because existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages,” the company said in a blog post. “But we believe the work we’re announcing today is a significant step forward in this journey.”

(In-ear fish hopefully not included.)

According to the social ad biz, SeamlessM4T follows from prior work like the corporation’s text-to-text translation model No Language Left Behind (NLLB), its Massively Multilingual Speech models, and its Universal Speech Translator for Hokkien, a language spoken in China and Southeast Asia.

Google, as mentioned at its recent IO developer conference, is working on its own Universal Translator project for automated video dubbing that’s synchronized with lip movements.

Meta claims that using a single model reduces errors and delays, making the translation process better and more efficient. However, it’s been suggested the Spanish-to-Vietnamese translation shown in the video narrated by Meta research scientist manager Paco Guzmán contains a typo and mispronounced a word.

So perhaps there’s room for further refinement.

SeamlessM4T, Meta claims, can handle 101 languages for speech input, 96 languages for text input and output, and 35 languages for speech output.

A paper on Seamless MT4 by more than 60 Meta researchers claims that the system handles background noises and speaker variations in speech-to-text tasks better than the current state-of-the-art model (spoiler: it’s OpenAI’s Whisper) by 38 percent and 49 percent respectively.

Also, Meta’s model is less prone to offer translations that introduce inappropriate assumptions or terms not present in the original text.

“Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety,” the paper says. “Compared to the state-of-the-art, we report up to 63 percent of reduction in added toxicity in our translation outputs.”

Meta’s model come in various sizes in terms of parameters, which suggest the comprehensiveness and utility of model: SeamlessM4T-LARGE (2.3 billion), SeamlessM4T-MEDIUM (1.2 billion), and (soon) SeamlessM4T-SMALL (281 million).

As a point of comparison, OpenAI’s automatic speech recognition model Whisper (large) has 1.55 billion parameters while the smallest version (tiny) has 39 million, it’s claimed.

Meta’s penchant for releasing models under open source licenses, or more restrictive but not entirely proprietary terms, purportedly prompted a Googler earlier this year to pen a memo warning that open source AI would out-compete Google and Microsoft-allied OpenAI.

“Open-source models are faster, more customizable, more private, and pound-for-pound more capable,” the leaked memo stated. “They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months.”

Indeed, if anyone were planning to profit from explicit AI imagery, the public release of open source text-to-image models – cited as a catalyst for the proliferation of non-consensual AI pornography – has commoditized that market. And with recent copyright rulings so much more is up for grabs.

But Google’s concern may give Meta too much credit as the social ad biz has been less than open in its licensing of late. Just as Meta’s LLaMA 2 license is not open source, SeamlessM4T’s license imposes limitations that make it less useful outside of academia.

“In keeping with our approach to open science, we’re publicly releasing SeamlessM4T under CC BY-NC 4.0 to allow researchers and developers to build on this work,” the company explained.

The CC BY-NC 4.0 license forbids commercial use, so developers looking to implement automated transcription or translation into English within an app may find OpenAI’s Whisper model, under an MIT license, more suitable. ®

Source: https://go.theregister.com/feed/www.theregister.com/2023/08/23/meta_seamlessm4t_translator_model/