Microsoft AI Generates Talking Heads from Noisy Audio Samples

Researchers at Microsoft have developed a technique that aims to improve the accuracy and quality of talking head generations past focusing on the audio stream. As per current talking head generation techniques, a clean and noise-gratuitous audio input with a neutral tone is mandatory but the researchers claim that their method "disentangles audio sequences" into factors like phonetic content, emotional tone, and background noise in lodge to work with any given audio sample.

"As we all know, oral communication is riddled with variations. Different people utter the same word in different contexts with varying duration, amplitude, tone and so on. In improver to linguistic (phonetic) content, speech carries abundant data revealing about the speaker'south emotional state, identity (gender, age, ethnicity) and personality to proper noun a few.", wrote the researchers in a paper titled "Animative Face using Disentangled Audio Representations".

The proposed methodology of researchers takes place in two stages. Firstly, the disentangled representations are identified from the audio source by a variational autoencoder(VAE). After the disentanglement is done, talking heads are generated from the categorized audio input based on the face paradigm input past a GAN-based video generator.

Microsoft researchers used iii different data sets to train and exam the VAE namely Filigree, CREMA-D, and LRS3. GRID is an audiovisual sentence corpus that contains one,000 recordings from 34 people – eighteen male, 16 female. CREMA-D is an audio dataset consisting of 7,442 clips from 91 ethnically-diverse actors – 48 male, 43 female. LRS3 is a dataset with over 100,000 spoken sentences from TED videos.

Based on the test results analysis, the researchers say that their method is capable to perform consistently over the entire emotional spectrum. "Nosotros validate our model by testing on noisy and emotional audio samples, and testify that our arroyo significantly outperforms the electric current country-of-the-art in the presence of such audio variations."

The researchers have too mentioned that their project can be expanded to place other speech factors similar the identity of a person and the gender in the time to come. Then, what are your thoughts on this audio-driven head generation technique? Let us know in the comments.