Reconstructing intelligible audio speech from visual speech features

This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio.

[1]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[2]  Ben P. Milner,et al.  Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise , 2006, INTERSPEECH.

[3]  Hervé Glotin,et al.  Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[4]  Sophie M. Wuerger,et al.  Continuous audio-visual digit recognition using N-best decision fusion , 2004, Inf. Fusion.

[5]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[6]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[7]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[8]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Jean-Philippe Thiran,et al.  On Dynamic Stream Weighting for Audio-Visual Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System , 2007 .

[12]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[13]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[14]  Ben P. Milner,et al.  Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[15]  Ben P. Milner,et al.  Visually Derived Wiener Filters for Speech Enhancement , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[17]  Jonas Beskow,et al.  Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions , 2014, Comput. Speech Lang..

[18]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[19]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[20]  Faheem Khan,et al.  Speaker separation using visual speech features and single-channel audio , 2013, INTERSPEECH.

[21]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[22]  Jon Barker,et al.  Evidence of correlation between acoustic and visual features of speech , 1999 .

[23]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[24]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[25]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[26]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[27]  Philip J. B. Jackson,et al.  Audio-visual Convolutive Blind Source Separation , 2010 .