End-To-End Generation of Talking Faces from Noisy Speech

Acoustic cues are not the only component in speech communication; if the visual counterpart is present, it is shown to benefit speech comprehension. In this work, we propose an end-to-end (no pre- or post-processing) system that can generate talking faces from arbitrarily long noisy speech. We propose a mouth region mask to encourage the network to focus on mouth movements rather than speech irrelevant movements. In addition, we use generative adversarial network (GAN) training to improve the image quality and mouth-speech synchronization. Furthermore, we employ noise-resilient training to make our network robust to unseen non-stationary noise. We evaluate our system with image quality and mouth shape (landmark) measures on noisy speech utterances with five types of unseen non-stationary noise between -10 dB and 30 dB signal-to-noise ratio (SNR) with increments of 1 dB SNR. Results show that our system outperforms a state-of-the-art baseline system significantly, and our noise-resilient training improves performance for noisy speech in a wide range of SNR.

[1]  J. Gower Generalized procrustes analysis , 1975 .

[2]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[3]  R. Freyman,et al.  The role of visual speech cues in reducing energetic and informational masking. , 2005, The Journal of the Acoustical Society of America.

[4]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[5]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[6]  Joshua G. W. Bernstein,et al.  Auditory and auditory-visual intelligibility of speech in fluctuating maskers for normal-hearing and hearing-impaired listeners. , 2009, The Journal of the Acoustical Society of America.

[7]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[8]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[9]  Adrian K. C. Lee,et al.  Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners , 2015, eLife.

[10]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[11]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[13]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[14]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[15]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[16]  Hang Zhou,et al.  Talking Face Generation by Adversarially Disentangled Audio-Visual Representation , 2018, AAAI.

[17]  Jingwen Zhu,et al.  Talking Face Generation by Conditional Recurrent Adversarial Network , 2018, IJCAI.

[18]  Chenliang Xu,et al.  Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[20]  Chenliang Xu,et al.  Noise-Resilient Training Method for Face Landmark Generation From Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.