Can You Hear Me Now? Google AI To Clean Audio on YouTube Stories

essidsolutions

The new enhancement uses Look-to-Listen machine learning capabilities to clean audio

Google AI has announcedOpens a new window a new audiovisual speech enhancement feature in YouTube Stories (iOS) that enables creators to make better selfie videos by automatically enhancing their voices and reducing noise. The new feature is based on Google’s Looking-to-Listen machine learning (ML) technology, which uses both visual and audio cues to isolate and separate the speech of a video subject from background sounds.

Two years ago, Google developed a machine learning technology that employs both visual and audio cues to isolate the speech of a video’s subject. By training the model on a large-scale collection of YouTube content, researchers at the company were able to capture correlations between speech and visual signals like mouth movements and facial expressions. These correlations can be used to separate one person’s speech in a video from another’s or to separate speech from loud background noises.

Announcing a new Speech Enhancement feature for YouTube Stories on iOS (based on the #LookingToListenOpens a new window speech isolation model) that allows creators to automatically enhance their voices and reduce background noise. Learn about the tech behind the feature at pic.twitter.com/EOY7ZRdwYkOpens a new window

— Google AI (@GoogleAI) October 1, 2020Opens a new window

According to Google software engineer Inbar Mosseri and Google Research scientist Michael Rubinstein, getting this technology into YouTube Stories was not an easy feat. Over the past year, the Looking-to-Listen team worked with YouTube video creators to learn how they would like to use the feature, what scenarios, and what balance of speech and background sounds they would like their videos to retain. The Looking-to-Listen model also had to be streamlined to run efficiently on mobile devices; all processing is done on-device within the YouTube app to minimize processing time and preserve privacy. And the technology had to be put through testing to ensure it performed consistently well across different recording conditions.

Looking-to-Listen works by first isolating video thumbnail images that contain the faces of speakers from a given stream. A component outputs visual features learned for the purpose of speech enhancement, extracted from the face thumbnails as the video is being recorded. After the recording completes, the audio and the computed features are streamed to an audiovisual separation model that produces the isolated and enhanced speech.

Mosseri and Rubinstein say that various architectural optimizations and improvements successfully reduced Looking-to-Listen’s running time from 10 times real-time on a desktop to 0.5 times performance using only an iPhone processor. Moreover, it brought the system’s size down from 120MB to 6MB. The result is that enhanced speech is available within seconds after YouTube Stories recordings finish.

Looking-to-Listen does not remove all background noise — Google says the users it surveyed preferred to keep sounds for ambiance — and the company claims the technology treats speakers of different appearances fairly. In a series of tests, the Looking-to-Listen team found the feature performed well across speakers of different ages, skin tones, spoken languages, voice pitches, visibility, head pose, facial hair, and accessories (like glasses).

YouTube creators eligible for YouTube Stories creation can record a video on iOS and select “Enhance speech” from the volume controls editing tool, which will immediately apply speech enhancement to the audio track and play back the enhanced speech in a loop. They can then compare the original video with the enhanced version.