Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space.

[1]  P S Heckerling,et al.  Communication with deaf patients. Knowledge, beliefs, and practices of physicians. , 1995, JAMA.

[2]  Sercan Ömer Arik,et al.  Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[3]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Helen Margellos-Anast,et al.  Developing a Standardized Comprehensive Health Survey for Use With Deaf Adults , 2005, American annals of the deaf.

[7]  Shifeng Zhang,et al.  S^3FD: Single Shot Scale-Invariant Face Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Xiaogang Wang,et al.  Vision-Infused Deep Audio Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[11]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ben P. Milner,et al.  Generating Intelligible Audio Speech From Visual Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Bonnie O'Day,et al.  Communicating about Health Care: Observations from Persons Who Are Deaf or Hard of Hearing , 2004, Annals of Internal Medicine.

[14]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[15]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[19]  Georgia Robins Sadler,et al.  Communication strategies for nurses interacting with patients who are deaf. , 2007, Dermatology nursing.

[20]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[22]  Christopher T Kello,et al.  A neural network model of the articulatory-acoustic forward mapping trained on recordings of articulatory parameters. , 2004, The Journal of the Acoustical Society of America.

[23]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[24]  Shmuel Peleg,et al.  Improved Speech Reconstruction from Silent Video , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[25]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[26]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Liangliang Cao,et al.  Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[30]  Kai Xu,et al.  LCANet: End-to-End Lipreading with Cascaded Attention-CTC , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[31]  Stefan Wermter,et al.  LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading , 2019, INTERSPEECH.

[32]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[34]  D. Lewkowicz,et al.  Infants deploy selective attention to the mouth of a talking face when learning speech , 2012, Proceedings of the National Academy of Sciences.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Maja Pantic,et al.  Video-Driven Speech Reconstruction using Generative Adversarial Networks , 2019, INTERSPEECH.

[38]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Rohit Jain,et al.  Lipper: Synthesizing Thy Speech using Multi-View Lipreading , 2019, AAAI.

[40]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[41]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.