Exemplar-based Lip-to-Speech Synthesis Using Convolutional Neural Networks

We propose in this paper a neural network-based lip-to-speech synthesis approach that converts “unvoiced” lip movements to “voiced” utterances. In our previous work, a lip-to-speech conversion method based on exemplar-based nonnegative matrix factorization (NMF) was proposed. However, this approach has several problems. First, the unnatural preprocessing of visual features is required to satisfy the nonnegativity constraint of NMF. Next, there is a possibility that an activity matrix cannot be shared between the visual and the audio feature in an NMF-based approach. To tackle these problems, in this paper, we use convolutional neural networks to convert visual features into audio features. Furthermore, we integrate an exemplar-based approach into the neural networks in order to adopt an advantage associated with our previous work. Experimental results showed that our proposed method produced more natural speech than conventional methods.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Ashish Verma,et al.  LATE INTEGRATION IN AUDIO-VISUAL CONTINUOUS SPEECH RECOGNITION , 1999 .

[3]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[4]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[5]  Simon King,et al.  Using HMM-based Speech Synthesis to Reconstruct the Voice of Individuals with Degenerative Speech Disorders , 2012, INTERSPEECH.

[6]  Tetsuya Takiguchi,et al.  Exemplar-based voice conversion in noisy environment , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[7]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[8]  Josef Chaloupka,et al.  Audio-visual speech recognition in noisy audio environments , 2013, 2013 36th International Conference on Telecommunications and Signal Processing (TSP).

[9]  Frédo Durand,et al.  The visual microphone , 2014, ACM Trans. Graph..

[10]  T. Takiguchi,et al.  LIP-TO-SPEECH SYNTHESIS USING LOCALITY-CONSTRAINT NON-NEGATIVE MATRIX FACTORIZATION , 2015 .

[11]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[12]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[13]  Jürgen Schmidhuber,et al.  Improving Speaker-Independent Lipreading with Domain-Adversarial Training , 2017, INTERSPEECH.

[14]  Liangliang Cao,et al.  Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).