Hear from the CIO, CTO and other C-level and senior executives on data and AI strategies at the Future of Work Summit on January 12, 2022. Learn more
People understand speech by listening to it and seeing the lip movements of the speakers. In fact, studies have shown that visual cues play a major role in language learning. In contrast, AI speech recognition systems are mostly built on – or complete – audio. And training them requires significant amounts of data, which is usually in the thousands of hours of recording.
Researchers at Meta (formerly Facebook) developed the audio-visual Hidden Unit BERT (AV-HuBERT), a framework that allows both to see speech – visuals – especially footage of mouth movements – to improve the performance of speech recognition systems. Learn to understand And people listen to what is being said. Meta claims that the AV-HuBERT is 75% more accurate than the best audiovisual speech recognition systems using the same amount of transcription. Moreover, the company says that AV-HuBERT surpasses the former best audiovisual speech recognition system by using one-tenth of the labeled data – making it potentially useful for languages with less audio data.
“In the future, AI frameworks like the AV-HuBERT could be used to improve the performance of speech recognition technology in everyday noisy situations – for example, interacting at a party or in a noisy street market,” said Abdelrahman Mohammedan, a meta-AI research scientist. In an interview. “And smart speakers equipped with auxiliaries, augmented reality glasses and cameras in smartphones – for example, the Alexa Echo Show – can also benefit from this technology.”
AI is not the first to apply for the problem of meta lip-reading. In 2016, researchers at Oxford University developed a system that was almost twice as accurate as experienced lip readers in some tests and could process video in close real-time. And in 2017, Alphabet-owned DeepMind trained a system in thousands of hours of TV shows to translate about 50% of the words correctly on a test set without errors, which is better than 12.4% of human experts.
But the University of Oxford and the Deepmind model, like many subsequent lip-reading models, were limited in the range of words they could recognize. Models also needed datasets attached to transcripts to train, and they could not process audio from any of the speakers in the video.
In a somewhat unique way, AV-HuBERT takes advantage of non-observation, or self-observation, learning. With supervised learning, algorithms such as DeepMind are trained on labeled example data until they can detect the underlying relationship between the examples and the specific output. For example, the system may be trained to write the word “dog” (output) when a picture of Corgi (for example) is shown. However, AV-HuBERT teaches itself to classify unlabelled data – processing data to learn from its underlying structure.
There is also the AV-HuBERT Multimodal In the sense that he learns to understand language through a series of audio and lip-movement signals. By combining signals such as lip and tooth movements while speaking, with audible information, Meta says AV-HuBERT can obtain “subtle connections” between two data types.
The initial AV-HuBERT model was trained on 30-hour labeled English-language TED talk videos, significantly less than the 31,000 hours on which the previous advanced model was trained. But despite training on less data, the AV-HuBERT’s Word Error Rate (WER), a measure of speech recognition performance, was slightly better at 32.5% compared to the old model’s 33.6% in case the speaker could be seen but not heard. (WER is calculated by dividing the number of incorrectly identified words by the total number of words; 32.5% translates into approximately one error in every 30 words.)
Once AV-HuBERT learned the structure and correlation between data, researchers were able to further train it on unlabeled data: 2,442 hours of English-language videos were uploaded to YouTube. Not only did this bring WER down to 26.9%, but Meta says that it shows that only a small amount of labeled data can be used to train a specific application (e.g., when multiple people speak together) or a framework for a different language. is needed. .
Indeed, Meta claims that the AV-HuBERT is about 50% better at a person’s speech than just audio models when there is loud music or noise playing in the background. And while speech and background noise are similar, the AV-HuBERT handles 3.2% WER, up from 25.5% of the previous best multimodal models.
In many ways, the AV-HuBERT symbolizes Meta’s growing investment in unsupervised, multimodal technology for complex tasks. The company recently detailed a new multimodal system designed to combat harmful content on its platforms, called Fu-Shot Learner, and released models that can learn speech, segment image, copy text style and learn to recognize objects from unlabelled data. . Unlike supervised systems, unsupervised systems can be significantly more flexible and cheaper to deploy; Labels in labeled datasets come from human critics who have to work hard to add everyone.
Because it requires less labeled data for training, Meta says the AV-HuBERT could open up possibilities for developing communication models for “low-resource” languages like Susu in the Niger Congo family. The AV-HuBERT could also be useful for creating speech recognition systems for people with speech impairments, the company suggests, as well as launching real lip movements for deepfax detection and virtual reality avatars.
But Os Keys, an AI atheist at the University of Washington, expressed concern that the AV-HuBERT has class and disability limitations. “If you’re trying to evaluate people’s speech patterns from ‘lip and tooth movements’, how does it work for people with distorted facial speech patterns as a result of disability?” They told VentureBeat via email. “Managing to create software for speech recognition based on lip reading seems kind of ironic, and there is a possibility of inaccuracies when … pointing to deaf people.”
Microsoft and Carnegie Mellon proposes a research roadmap toward fairness in AI in the paper. “Other conditions that result in typical facial differences.” Such a system could also fail for people who have had a stroke, researchers have noted, who have Parkinson’s disease, Bell’s palsy, autism or Williams syndrome – who cannot (or cannot) use facial expressions such as neurotypical. People
In an email, Mohammed said the AV-HuBERT only focuses on the lip region to capture lip movement – not the entire face. Like most AI models, the performance of the AV-HuBERT will be “proportional to the number of representative samples of different populations in the training data”.
“To evaluate our approach, we used the publicly available LRS3 dataset, which includes TED talk video that was made publicly available in 2018 by researchers at Oxford University. Since this dataset does not represent disabled speakers, we do not have a definite percentage for the expected performance reduction, “said Mohammed.[But this] The new proposed technology is not limited by the current speaker distribution in the training dataset. We expect that the various training datasets with comprehensive and diverse population coverage will bring significant performance benefits. “
Meta says it will “continue to benchmark and develop approaches that improve the audio-visual speech recognition model in everyday situations where background noise and speaker overlap are common.” In addition, it plans to expand the AV-HuBERT – which Meta does not plan to put into production – to a multilingual benchmark in addition to English.
VentureBeat’s mission is to become a digital town square for technical decision makers to gain knowledge about transformative technology and practices. Our site delivers essential information on data technologies and strategies so you can lead your organizations. We invite you to access, to become a member of our community:
- Up-to-date information on topics of interest to you
- Our newsletters
- Gated idea-leader content and discounted access to our precious events, such as Transform 2021: Learn more
- Networking features and more
Become a member