Speech Reconstitution using Multi-view Silent Videos

Speechreading broadly involves looking, perceiving, and interpreting spoken symbols. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has ventured into generating (audio) speech from silent video sequences but there have been no developments in using multiple cameras for speech generation. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.

[1]  Liangliang Cao,et al.  Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  M RezaAli,et al.  Realization of the Contrast Limited Adaptive Histogram Equalization (CLAHE) for Real-Time Image Enhancement , 2004 .

[3]  Samuel Pachoud,et al.  Macro-cuboïd based probabilistic matching for lip-reading digits , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[5]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[6]  John G. Beerends,et al.  A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation , 1992 .

[7]  Mohammed Bennamoun,et al.  Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[9]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[10]  Gerasimos Potamianos,et al.  Lipreading Using Profile Versus Frontal Views , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[11]  Sridha Sridharan,et al.  Continuous pose-invariant lipreading , 2008, INTERSPEECH.

[12]  Ben P. Milner,et al.  Reconstructing intelligible audio speech from visual speech features , 2015, INTERSPEECH.

[13]  C. Benoît,et al.  A set of French visemes for visual speech synthesis , 1994 .

[14]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[15]  Mostafa Mehdipour-Ghazi,et al.  Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System , 2016, ACCV Workshops.

[16]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[17]  Kee-Eung Kim,et al.  Multi-view Automatic Lip-Reading Using Neural Network , 2016, ACCV Workshops.

[18]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[19]  Barry-John Theobald,et al.  View Independent Computer Lip-Reading , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[20]  Tsuhan Chen,et al.  Profile View Lip Reading , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[22]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[23]  Sadaoki Furui,et al.  Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images , 2007, EURASIP J. Audio Speech Music. Process..

[24]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[25]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[26]  Maja Pantic,et al.  End-to-End Multi-View Lipreading , 2017, BMVC.

[27]  Barry-John Theobald,et al.  Improving visual features for lip-reading , 2010, AVSP.

[28]  Birger Kollmeier,et al.  PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Shmuel Peleg,et al.  Improved Speech Reconstruction from Silent Video , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[30]  G. Fant Acoustic theory of speech production : with calculations based on X-ray studies of Russian articulations , 1961 .

[31]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[32]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[33]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[34]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[35]  Hongbin Zha,et al.  Unsupervised Random Forest Manifold Alignment for Lipreading , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Walid Mahdi,et al.  A New Visual Speech Recognition Approach for RGB-D Cameras , 2014, ICIAR.

[38]  Richard M. Stern,et al.  Signal and Feature Compensa-tion Methods for Robust Speech Recognition , 2002 .

[39]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[40]  K. Krigger Cerebral palsy: an overview. , 2006, American family physician.

[41]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Thomas Sporer,et al.  PEAQ - The ITU Standard for Objective Measurement of Perceived Audio Quality , 2000 .

[43]  Ali M. Reza,et al.  Realization of the Contrast Limited Adaptive Histogram Equalization (CLAHE) for Real-Time Image Enhancement , 2004, J. VLSI Signal Process..

[44]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[45]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[46]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[47]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[48]  Richard Bowden,et al.  Learning Sequential Patterns for Lipreading , 2011, BMVC.

[49]  Walid Mahdi,et al.  Unified System for Visual Speech Recognition and Speaker Identification , 2015, ACIVS.