Lipreading with long short-term memory

Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy than conventional methods. Feedforward and recurrent neural network layers (namely Long Short-Term Memory; LSTM) are stacked to form a single structure which is trained by back-propagating error gradients through all the layers. The performance of such a stacked network was experimentally evaluated and compared to a standard Support Vector Machine classifier using conventional computer vision features (Eigenlips and Histograms of Oriented Gradients). The evaluation was performed on data from 19 speakers of the publicly available GRID corpus. With 51 different words to classify, we report a best word accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural network-based solution (11.6% improvement over the best feature-based solution evaluated).

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[3]  Noboru Sugie,et al.  A Speech Prosthesis Employing a Speech Synthesizer-Vowel Discrimination from Perioral Muscle Activities and Vowel Production , 1985, IEEE Transactions on Biomedical Engineering.

[4]  M.S. Morse,et al.  Use of myoelectric signals to recognize speech , 1989, Images of the Twenty-First Century. Proceedings of the Annual International Engineering in Medicine and Biology Society,.

[5]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[6]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[7]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[8]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[12]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[13]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[14]  Bruce Denby,et al.  Speech synthesis from real time ultrasound images of the tongue , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[17]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[18]  Gérard Chollet,et al.  Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.

[19]  Gérard Chollet,et al.  Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  J. M. Gilbert,et al.  Development of a (silent) speech recognition system for patients following laryngectomy. , 2008, Medical engineering & physics.

[21]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[22]  Barry-John Theobald,et al.  Comparing visual features for lipreading , 2009, AVSP.

[23]  Gérard Chollet,et al.  Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[24]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[25]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[26]  Tanja Schultz,et al.  Modeling coarticulation in EMG-based continuous speech recognition , 2010, Speech Commun..

[27]  Naomi Harte,et al.  Viseme definitions comparison for visual-only speech recognition , 2011, 2011 19th European Signal Processing Conference.

[28]  Jürgen Schmidhuber,et al.  A committee of neural networks for traffic sign classification , 2011, The 2011 International Joint Conference on Neural Networks.

[29]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[30]  Barry-John Theobald,et al.  Is automated conversion of video to text a reality? , 2012, Optics/Photonics in Security and Defence.

[31]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[32]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[33]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Barry-John Theobald,et al.  Recent developments in automated lip-reading , 2013, Optics/Photonics in Security and Defence.

[35]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[36]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[37]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[38]  Carlos Busso,et al.  Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[39]  Tanja Schultz,et al.  Tackling Speaking Mode Varieties in EMG-Based Speech Recognition , 2014, IEEE Transactions on Biomedical Engineering.

[40]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.