Advances and Challenges in Deep Lip Reading

Driven by deep learning techniques and large-scale datasets, recent years have witnessed a paradigm shift in automatic lip reading. While the main thrust of Visual Speech Recognition (VSR) was improving accuracy of Audio Speech Recognition systems, other potential applications, such as biometric identification, and the promised gains of VSR systems, have motivated extensive efforts on developing the lip reading technology. This paper provides a comprehensive survey of the stateof-the-art deep learning based VSR research with a focus on data challenges, task-specific complications, and the corresponding solutions. Advancements in these directions will expedite the transformation of silent speech interface from theory to practice. We also discuss the main modules of a VSR pipeline and the influential datasets. Finally, we introduce some typical VSR application concerns and impediments to real-world scenarios as well as future research directions.

[1]  Ed H. Chi,et al.  Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[2]  Federico Sukno,et al.  Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[3]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  L. Auger The Journal of the Acoustical Society of America , 1949 .

[5]  Kee-Eung Kim,et al.  Multi-view Automatic Lip-Reading Using Neural Network , 2016, ACCV Workshops.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Maja Pantic,et al.  Towards Practical Lipreading with Distilled and Efficient Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Xilin Chen,et al.  Mutual Information Maximization for Effective Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[9]  Bayya Yegnanarayana,et al.  Multimodal person authentication using speech, face and visual speech , 2008, Comput. Vis. Image Underst..

[10]  Dimitris Kastaniotis,et al.  Lip Reading in Greek words at unconstrained driving scenario , 2019, 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA).

[11]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[12]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[13]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[16]  Mansour Jamzad,et al.  SFAVD: Sharif Farsi audio visual database , 2013, The 5th Conference on Information and Knowledge Technology.

[17]  Andrzej Czyzewski,et al.  A comparative study of English viseme recognition methods and algorithms , 2017, Multimedia Tools and Applications.

[18]  Yandong Guo,et al.  Discriminative Multi-Modality Speech Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Mubarak Shah,et al.  An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos , 2017, ArXiv.

[21]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[22]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Maja Pantic,et al.  End-to-End Speech-Driven Realistic Facial Animation with Temporal GANs , 2019, CVPR Workshops.

[24]  Chin-Hui Lee,et al.  Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Thabo Beeler,et al.  3D Morphable Face Models—Past, Present, and Future , 2020, ACM Trans. Graph..

[26]  Walid Mahdi,et al.  A New Visual Speech Recognition Approach for RGB-D Cameras , 2014, ICIAR.

[27]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[28]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[29]  Jian Yang,et al.  Convolution Neural Networks With Two Pathways for Image Style Recognition , 2017, IEEE Transactions on Image Processing.

[30]  Daqing Chen,et al.  Deep Learning-Based Automated Lip-Reading: A Survey , 2021, IEEE Access.

[31]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[32]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[33]  Federico Sukno,et al.  Survey on automatic lip-reading in the era of deep learning , 2018, Image Vis. Comput..

[34]  Xi Zhou,et al.  Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech Recognition , 2018, ArXiv.

[35]  Sridha Sridharan,et al.  Patch-based analysis of visual speech from multiple views , 2008, AVSP.

[36]  Shadrokh Samavi,et al.  Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation , 2019, 2020 International Conference on Machine Vision and Image Processing (MVIP).

[37]  MINGFENG HAO,et al.  A Survey of Lipreading Methods Based on Deep Learning , 2020, ICIP 2020.

[38]  Kai Xu,et al.  LCANet: End-to-End Lipreading with Cascaded Attention-CTC , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[39]  Lianqiang Zhou,et al.  Hallucinating Optical Flow Features for Video Classification , 2019, IJCAI.

[40]  Maja Pantic,et al.  Towards Pose-Invariant Lip-Reading , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[42]  Abdesselam Bouzerdoum,et al.  Video Classification Based on Spatial Gradient and Optical Flow Descriptors , 2015, 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[43]  Jieping Ye,et al.  Object Detection in 20 Years: A Survey , 2019, Proceedings of the IEEE.

[44]  Stefanos Zafeiriou,et al.  RetinaFace: Single-stage Dense Face Localisation in the Wild , 2019, ArXiv.

[45]  Shuang Yang,et al.  Learn an Effective Lip Reading Model without Pains , 2020, ArXiv.

[46]  Matti Pietikäinen,et al.  A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Cheol Hoon Park,et al.  Robust Audio-Visual Speech Recognition Based on Late Integration , 2008, IEEE Transactions on Multimedia.

[48]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[49]  Hans Peter Graf,et al.  Triphone based unit selection for concatenative visual speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[50]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[51]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[52]  Haihong Tang,et al.  Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers , 2019, AAAI.

[53]  Christos-Savvas Bouganis,et al.  Approximate LSTMs for Time-Constrained Inference: Enabling Fast Reaction in Self-Driving Cars , 2019, IEEE Consumer Electronics Magazine.

[54]  Maja Pantic,et al.  End-to-End Visual Speech Recognition for Small-Scale Datasets , 2019, Pattern Recognit. Lett..

[55]  Darryl Stewart,et al.  Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos , 2008, EURASIP J. Image Video Process..

[56]  Andrzej Czyzewski,et al.  An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[57]  Farzin Deravi,et al.  Design issues for a digital audio-visual integrated database , 1996 .

[58]  Maja Pantic,et al.  Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[59]  Carlos Busso,et al.  End-to-End Audiovisual Speech Recognition System With Multitask Learning , 2021, IEEE Transactions on Multimedia.

[60]  Hong Liu,et al.  A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion , 2016, IEEE Transactions on Multimedia.

[61]  Daqing Chen,et al.  Disentangling Homophemes in Lip Reading using Perplexity Analysis , 2020, ArXiv.

[62]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[63]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[65]  Jianjun Hou,et al.  Learning two-pathway convolutional neural networks for categorizing scene images , 2017, Multimedia Tools and Applications.

[66]  Matti Pietikäinen,et al.  Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.

[67]  A Markides,et al.  Speechreading (lipreading). , 1979, Child: care, health and development.

[68]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[69]  Patrick Gros,et al.  Audiovisual integration with Segment Models for tennis video parsing , 2008, Comput. Vis. Image Underst..

[70]  Jixiang Du,et al.  Lipreading with DenseNet and resBi-LSTM , 2020, Signal Image Video Process..

[71]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[72]  Deep Learning and Parallel Computing Environment for Bioengineering Systems , 2019 .

[73]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[74]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[75]  Dong Yu,et al.  Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[76]  Shiguang Shan,et al.  LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild , 2018, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[77]  Maja Pantic,et al.  Lip-reading with Densely Connected Temporal Convolutional Networks , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[78]  Maja Pantic,et al.  End-to-End Multi-View Lipreading , 2017, BMVC.

[79]  Shuang Yang,et al.  Deformation Flow Based Two-Stream Network for Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[80]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[81]  Trevor Darrell,et al.  Production domain modeling of pronunciation for visual speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[82]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[83]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[84]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[85]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[86]  James T. Kwok,et al.  Generalizing from a Few Examples , 2019, ACM Comput. Surv..

[87]  Joshua Tenenbaum,et al.  Building 3D Morphable Models from a Single Scan , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[88]  GrosPatrick,et al.  Audiovisual integration with Segment Models for tennis video parsing , 2008 .

[89]  Carlos Busso,et al.  Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[90]  Ching-Te Chiu,et al.  Multi-teacher Knowledge Distillation for Compressed Video Action Recognition on Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[91]  Kris Kitani,et al.  Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading , 2019, BMVC.

[92]  Chalapathy Neti,et al.  Audio-visual large vocabulary continuous speech recognition in the broadcast domain , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[93]  Nassir Navab,et al.  The speaker-independent lipreading play-off; a survey of lipreading machines , 2018, 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS).

[94]  Suprava Patnaik,et al.  Comparison of classifiers for lip reading with CUAVE and TULIPS database , 2015 .

[95]  Feng Tian,et al.  Image Annotation with Weak Labels , 2013, WAIM.

[96]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Themos Stafylakis,et al.  Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs , 2018, Comput. Vis. Image Underst..

[98]  Guoqiang Han,et al.  Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[100]  Trevor Darrell,et al.  Multistream Articulatory Feature-Based Models for Visual Speech Recognition , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[101]  Xilin Chen,et al.  Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[102]  Nanning Zheng,et al.  EleAtt-RNN: Adding Attentiveness to Neurons in Recurrent Neural Networks , 2020, IEEE Transactions on Image Processing.

[103]  Joon Son Chung,et al.  Learning to lip read words by watching videos , 2018, Comput. Vis. Image Underst..

[104]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[105]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[107]  Shilin Wang,et al.  Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[108]  Perry Xiao,et al.  Lip Reading Sentences Using Deep Learning With Only Visual Cues , 2020, IEEE Access.

[109]  Chin-Hui Lee,et al.  Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention , 2020, ArXiv.

[110]  Roger Zimmermann,et al.  Harnessing GANs for Addition of New Classes in VSR , 2019, ArXiv.

[111]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[112]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[113]  Yang Song,et al.  Improving the Robustness of Deep Neural Networks via Stability Training , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[115]  Federico Vaggi,et al.  GANs for Biological Image Synthesis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[116]  Jean-Philippe Thiran,et al.  Mutual information eigenlips for audio-visual speech recognition , 2006, 2006 14th European Signal Processing Conference.

[117]  Mingli Song,et al.  A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading , 2019, MMAsia.

[118]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[119]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[120]  Nicu Sebe,et al.  Multimodal Human Computer Interaction: A Survey , 2005, ICCV-HCI.

[121]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[122]  Peratham Wiriyathammabhum,et al.  SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading , 2020, ICONIP.

[123]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[124]  Richard Harvey,et al.  Comparing phonemes and visemes with DNN-based lipreading , 2018, ArXiv.

[125]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[126]  Nima Tajbakhsh,et al.  Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? , 2016, IEEE Transactions on Medical Imaging.

[127]  Maja Pantic,et al.  Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[128]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[129]  Shuang Yang,et al.  Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[130]  Naomi Harte,et al.  Can DNNs Learn to Lipread Full Sentences? , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[131]  Joon Son Chung,et al.  ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[132]  Bogdan Ionescu,et al.  LRRo: a lip reading data set for the under-resourced romanian language , 2020, MMSys.

[133]  Christoph H. Lampert,et al.  Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[134]  Kurban Ubul,et al.  A Survey of Research on Lipreading Technology , 2020, IEEE Access.

[135]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).