论文信息 - Survey on automatic lip-reading in the era of deep learning

Survey on automatic lip-reading in the era of deep learning

Abstract In the last few years, there has been an increasing interest in developing systems for Automatic Lip-Reading (ALR). Similarly to other computer vision applications, methods based on Deep Learning (DL) have become very popular and have permitted to substantially push forward the achievable performance. In this survey, we review ALR research during the last decade, highlighting the progression from approaches previous to DL (which we refer to as traditional) toward end-to-end DL architectures. We provide a comprehensive list of the audio-visual databases available for lip-reading, describing what tasks they can be used for, their popularity and their most important characteristics, such as the number of speakers, vocabulary size, recording settings and total duration. In correspondence with the shift toward DL, we show that there is a clear tendency toward large-scale datasets targeting realistic application settings and large numbers of samples per class. On the other hand, we summarize, discuss and compare the different ALR systems proposed in the last decade, separately considering traditional and DL approaches. We address a quantitative analysis of the different systems by organizing them in terms of the task that they target (e.g. recognition of letters or digits and words or sentences) and comparing their reported performance in the most commonly used datasets. As a result, we find that DL architectures perform similarly to traditional ones for simpler tasks but report significant improvements in more complex tasks, such as word or sentence recognition, with up to 40% improvement in word recognition rates. Hence, we provide a detailed description of the available ALR systems based on end-to-end DL architectures and identify a tendency to focus on the modeling of temporal context as the key to advance the field. Such modeling is dominated by recurrent neural networks due to their ability to retain context at multiple scales (e.g. short- and long-term information). In this sense, current efforts tend toward techniques that allow a more comprehensive modeling and interpretability of the retained context.

Federico Sukno | Adriana Fernandez-Lopez | F. Sukno | Adriana Fernandez-Lopez

[1] Matti Pietikäinen,et al. Towards a practical lipreading system , 2011, CVPR 2011.

[2] Darryl Stewart,et al. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos , 2008, EURASIP J. Image Video Process..

[3] Andrzej Czyzewski,et al. An audio-visual corpus for multimodal automatic speech recognition , 2017, Journal of Intelligent Information Systems.

[4] Kevin P. Murphy,et al. A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Brian Kan-Wing Mak,et al. End-To-End Low-Resource Lip-Reading with Maxout Cnn and Lstm , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Stephen J. Cox,et al. Improving lip-reading performance for robust audiovisual speech recognition using DNNs , 2015, AVSP.

[7] Jeff A. Bilmes,et al. DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8] Stefanos Zafeiriou,et al. A survey on mouth modeling and analysis for Sign Language recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[9] J.N. Gowdy,et al. CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10] Maja Pantic,et al. End-to-End Audiovisual Fusion with LSTMs , 2017, AVSP.

[11] Walid Mahdi,et al. A New Visual Speech Recognition Approach for RGB-D Cameras , 2014, ICIAR.

[12] Barry-John Theobald,et al. Improving visual features for lip-reading , 2010, AVSP.

[13] Naomi Harte,et al. Viseme definitions comparison for visual-only speech recognition , 2011, 2011 19th European Signal Processing Conference.

[14] Maja Pantic,et al. Hierarchical On-line Appearance-Based Tracking for 3D head pose, eyebrows, lips, eyelids and irises , 2013, Image Vis. Comput..

[15] Jürgen Schmidhuber,et al. Training Very Deep Networks , 2015, NIPS.

[16] Pierre Roussel-Ragot,et al. An Articulatory-Based Singing Voice Synthesis Using Tongue and Lips Imaging , 2016, INTERSPEECH.

[17] Jana Trojanová,et al. Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition , 2008, LREC.

[18] Kevin P. Murphy,et al. Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[19] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[20] Stefanos Zafeiriou,et al. Robust Discriminative Response Map Fitting with Constrained Local Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Roland Göcke,et al. The audio-video australian English speech data corpus AVOZES , 2012, INTERSPEECH.

[22] Themos Stafylakis,et al. Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[23] Darryl Stewart,et al. AN investigation into features for multi-view lipreading , 2010, 2010 IEEE International Conference on Image Processing.

[24] Kah Phooi Seng,et al. A new multi-purpose audio-visual UNMC-VIER database with multiple variabilities , 2011, Pattern Recognit. Lett..

[25] Walid Mahdi,et al. An adaptive approach for lip-reading using image and depth data , 2015, Multimedia Tools and Applications.

[26] Darryl Stewart,et al. Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions , 2014, IEEE Transactions on Cybernetics.

[27] Rong Chen,et al. A PCA Based Visual DCT Feature Extraction Method for Lip-Reading , 2006, 2006 International Conference on Intelligent Information Hiding and Multimedia.

[28] Barry-John Theobald,et al. Comparison of human and machine-based lip-reading , 2009, AVSP.

[29] Joon Son Chung,et al. The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[30] Jürgen Schmidhuber,et al. Improving Speaker-Independent Lipreading with Domain-Adversarial Training , 2017, INTERSPEECH.

[31] Mostafa Mehdipour-Ghazi,et al. Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System , 2016, ACCV Workshops.

[32] Mohammed Bennamoun,et al. A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition , 2017, Speech Commun..

[33] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[34] Jean-Philippe Thiran,et al. Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[35] Jürgen Schmidhuber,et al. Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Tomasz Jadczyk,et al. Audiovisual database of Polish speech recordings , 2012 .

[37] C. G. Fisher,et al. Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[38] Jean-Philippe Thiran,et al. Multi-pose lipreading and audio-visual speech recognition , 2012, EURASIP J. Adv. Signal Process..

[39] Richard Harvey,et al. Improving Computer Lipreading via DNN Sequence Discriminative Training Techniques , 2017, INTERSPEECH.

[40] Sridha Sridharan,et al. Patch-based analysis of visual speech from multiple views , 2008, AVSP.

[41] Stefanos Zafeiriou,et al. 300 Faces In-The-Wild Challenge: database and results , 2016, Image Vis. Comput..

[42] Shmuel Peleg,et al. Improved Speech Reconstruction from Silent Video , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[43] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[44] Richard Harvey,et al. Comparing phonemes and visemes with DNN-based lipreading , 2018, ArXiv.

[45] Maja Pantic,et al. End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[47] Joon Son Chung,et al. Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[48] Alice Caplier,et al. Accurate and quasi-automatic lip tracking , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[49] Richard B. Reilly,et al. VALID: A New Practical Audio-Visual Database, and Comparative Results , 2005, AVBPA.

[50] Andréa Britto Mattos,et al. Multi-view Mouth Renderization for Assisting Lip-reading , 2018, W4A.

[51] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[52] K. Munhall,et al. Spatial statistics of gaze fixations during dynamic face processing , 2007, Social neuroscience.

[53] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[54] Barry-John Theobald,et al. Comparing visual features for lipreading , 2009, AVSP.

[55] Jiri Matas,et al. XM2VTSDB: The Extended M2VTS Database , 1999 .

[56] Kee-Eung Kim,et al. Multi-view Automatic Lip-Reading Using Neural Network , 2016, ACCV Workshops.

[57] S. Lelandais,et al. The IV2 Multimodal Biometric Database (Including Iris, 2D, 3D, Stereoscopic, and Talking Face Data), and the IV2-2007 Evaluation Campaign , 2008, 2008 IEEE Second International Conference on Biometrics: Theory, Applications and Systems.

[58] Matti Pietikäinen,et al. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[59] Barry-John Theobald,et al. Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading? , 2014, ISVC.

[60] Shimon Whiteson,et al. LipNet: Sentence-level Lipreading , 2016, ArXiv.

[61] Jean-Philippe Thiran,et al. On Dynamic Stream Weighting for Audio-Visual Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[62] Matti Pietikäinen,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[63] Ming Liu,et al. AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[64] Mohammed Bennamoun,et al. Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[65] Jing Huang,et al. Audio-visual speech recognition using an infrared headset , 2004, Speech Commun..

[66] Shmuel Peleg,et al. Seeing Through Noise: Speaker Separation and Enhancement using Visually-derived Speech , 2017, ArXiv.

[67] Dominic Howell,et al. Confusion modelling for lip-reading , 2015 .

[68] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[69] Matti Pietikäinen,et al. A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70] Kai Xu,et al. LCANet: End-to-End Lipreading with Cascaded Attention-CTC , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[71] Sridha Sridharan,et al. Continuous pose-invariant lipreading , 2008, INTERSPEECH.

[72] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[73] Joon Son Chung,et al. Lip Reading in Profile , 2017, BMVC.

[74] Hongbin Zha,et al. Unsupervised Random Forest Manifold Alignment for Lipreading , 2013, 2013 IEEE International Conference on Computer Vision.

[75] Yun Fu,et al. Lipreading by Locality Discriminant Graph , 2007, 2007 IEEE International Conference on Image Processing.

[76] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[77] R. Daniloff,et al. Investigation of the timing of velar movements during speech. , 1971, The Journal of the Acoustical Society of America.

[78] Petros Maragos,et al. Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Trans. Speech Audio Process..

[79] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[80] Gerasimos Potamianos,et al. Lipreading Using Profile Versus Frontal Views , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[81] Sridha Sridharan,et al. Can Audio-Visual Speech Recognition Outperform Acoustically Enhanced Speech Recognition in Automotive Environment? , 2011, INTERSPEECH.

[82] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[83] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[84] Alan Wee-Chung Liew,et al. An Automatic Lipreading System for Spoken Digits With Limited Training Data , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[85] Dominique Estival,et al. AusTalk: an audio-visual corpus of Australian English , 2014, LREC.

[86] Alex Pentland,et al. Automatic lipreading by optical-flow analysis , 1989 .

[87] Juergen Luettin,et al. Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[88] Maja Pantic,et al. Discriminating Native from Non-Native Speech Using Fusion of Visual Cues , 2014, ACM Multimedia.

[89] Tetsuya Takiguchi,et al. Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss , 2016, INTERSPEECH.

[90] Shmuel Peleg,et al. Visual Speech Enhancement , 2017, INTERSPEECH.

[91] Maja Pantic,et al. End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92] Richard Harvey,et al. Phoneme-to-viseme mappings: the good, the bad, and the ugly , 2017, Speech Commun..

[93] Shigeru Katagiri,et al. Construction of a large-scale Japanese speech database and its management system , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[94] Alexander L. Ronzhin,et al. HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech , 2016, SPECOM.

[95] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[96] Vaibhava Goel,et al. Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97] Sridha Sridharan,et al. A unified approach to multi-pose audio-visual ASR , 2007, INTERSPEECH.

[98] Maja Pantic,et al. Visual-Only Recognition of Normal, Whispered and Silent Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[99] Jing Huang,et al. Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[100] Maja Pantic,et al. Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[101] Matti Pietikäinen,et al. Lipreading: A Graph Embedding Approach , 2010, 2010 20th International Conference on Pattern Recognition.

[102] James R. Glass,et al. A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[103] Stephen J. Cox,et al. Speaker-independent machine lip-reading with speaker-dependent viseme classifiers , 2015, AVSP.

[104] Samuel Pachoud,et al. Macro-cuboïd based probabilistic matching for lip-reading digits , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[105] Léon J. M. Rothkrantz,et al. Automatic Visual Speech Recognition , 2012 .

[106] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107] Dinesh Kant Kumar,et al. Visual Speech Recognition Using Motion Features and Hidden Markov Models , 2007, CAIP.

[108] W. H. Sumby,et al. Visual contribution to speech intelligibility in noise , 1954 .

[109] Tsuhan Chen,et al. Profile View Lip Reading , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[110] Farshad Almasganj,et al. Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features , 2017, 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA).

[111] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[112] Richard Harvey,et al. Decoding visemes: Improving machine lip-reading , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[113] Matti Pietikäinen,et al. A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[114] Maja Pantic,et al. End-to-End Multi-View Lipreading , 2017, BMVC.

[115] Jon Barker,et al. Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment , 2008, Speech Commun..

[116] Qiang Chen,et al. Network In Network , 2013, ICLR.

[117] Stephen J. Cox,et al. Visual units and confusion modelling for automatic lip-reading , 2016, Image Vis. Comput..

[118] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[119] Engin Erzin,et al. Comparison of Phoneme and Viseme Based Acoustic Units for Speech Driven Realistic Lip Animation , 2007 .

[120] Léon J. M. Rothkrantz,et al. Automatic Lip Reading in the Dutch Language Using Active Appearance Models on High Speed Recordings , 2010, TSD.

[121] Hong Liu,et al. A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion , 2016, IEEE Transactions on Multimedia.

[122] Junzhou Huang,et al. Face Landmark Fitting via Optimized Part Mixtures and Cascaded Deformable Model , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[123] Tal Hassner,et al. Facial Landmark Detection with Tweaked Convolutional Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[124] Barry-John Theobald,et al. Recent developments in automated lip-reading , 2013, Optics/Photonics in Security and Defence.

[125] Tetsuya Ogata,et al. Lipreading using convolutional neural network , 2014, INTERSPEECH.

[126] Faranak Fotouhi Ghazvini,et al. Mobile phone security using automatic lip reading , 2015, 2015 9th International Conference on e-Commerce in Developing Countries: With focus on e-Business (ECDC).

[127] Jean-Philippe Thiran,et al. The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[128] Federico Sukno,et al. Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[129] Maja Pantic,et al. Empirical analysis of cascade deformable models for multi-view face detection , 2013, Image Vis. Comput..

[130] Chien-Yao Wang,et al. A survey of visual lip reading and lip-password verification , 2015, International Conference on Orange Technologies.

[131] Hongxun Yao,et al. HIT-AVDB-II: A New Multi-view and Extreme Feature Cases Contained Audio-Visual Database for Biometrics , 2008 .

[132] Dorothea Kolossa,et al. Audiovisual speech recognition with missing or unreliable data , 2009, AVSP.

[133] Juergen Luettin,et al. Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[134] Alejandro F. Frangi,et al. AV@CAR: A Spanish Multichannel Multimodal Corpus for In-Vehicle Automatic Audio-Visual Speech Recognition , 2004, LREC.

[135] Aarti Gupta,et al. Automated Lip Reading Technique for Password Authentication , 2012 .

[136] Satoshi Nakamura,et al. CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition , 2010, AVSP.

[137] Barry-John Theobald,et al. View Independent Computer Lip-Reading , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[138] Yuxuan Lan,et al. Finding phonemes: improving machine lip-reading , 2015, AVSP.

[139] Jean-Philippe Thiran,et al. Multipose audio-visual speech recognition , 2011, 2011 19th European Signal Processing Conference.

[140] N. P. Erber. Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[141] David B. Pisoni,et al. Language identification from visual-only speech signals , 2010, Attention, perception & psychophysics.

[142] Luc Van Gool,et al. Face Detection without Bells and Whistles , 2014, ECCV.

[143] Takeshi Saitoh,et al. Profile Lip Reading for Vowel and Word Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[144] Stephen J. Cox,et al. The challenge of multispeaker lip-reading , 2008, AVSP.

[145] Daniel Roggen,et al. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[146] Stefanos Zafeiriou,et al. A survey on face detection in the wild: Past, present and future , 2015, Comput. Vis. Image Underst..

[147] Richard Bowden,et al. Learning temporal signatures for Lip Reading , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[148] Xuelong Li,et al. Temporal Multimodal Learning in Audiovisual Speech Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[149] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[150] Richard Bowden,et al. Learning Sequential Patterns for Lipreading , 2011, BMVC.

[151] Gerasimos Potamianos,et al. Dynamic Stream Weight Modeling for Audio-Visual Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[152] Maja Pantic,et al. Visual-only discrimination between native and non-native speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[153] Conrad Sanderson,et al. The VidTIMIT Database , 2002 .

[154] Laurent Girin,et al. Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces , 2016, PLoS Comput. Biol..

[155] Mahesh Chandra,et al. Multiple camera in car audio-visual speech recognition using phonetic and visemic information , 2015, Comput. Electr. Eng..

[156] Stephen J. Cox,et al. Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[157] W. Twaddell,et al. On Defining the Phoneme , 1935 .

[158] Frédéric Bimbot,et al. BL-Database: A French audiovisual database for speech driven lip animation systems , 2011 .

[159] Maja Pantic,et al. Fast Algorithms for Fitting Active Appearance Models to Unconstrained Images , 2016, International Journal of Computer Vision.

[160] Federico Sukno,et al. Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading , 2017, VISIGRAPP.

[161] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[162] Matti Pietikäinen,et al. Bi-Modal Person Recognition on a Mobile Phone: Using Mobile Phone Data , 2012, 2012 IEEE International Conference on Multimedia and Expo Workshops.

[163] Tetsuya Ogata,et al. Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[164] Matti Pietikäinen,et al. Concatenated Frame Image Based CNN for Visual Speech Recognition , 2016, ACCV Workshops.

[165] Maja Pantic,et al. Discrimination Between Native and Non-Native Speech Using Visual Features Only , 2016, IEEE Transactions on Cybernetics.

[166] Satoshi Tamura,et al. Integration of deep bottleneck features for audio-visual speech recognition , 2015, INTERSPEECH.

[167] Isabel de los Reyes Rodríguez Ortiz,et al. Lipreading in the Prelingually Deaf: What makes a Skilled Speechreader? , 2008, The Spanish Journal of Psychology.

[168] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[169] Haohan Wang,et al. Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition , 2014 .

[170] Barry-John Theobald,et al. Insights into machine lip reading , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[171] Naomi Harte,et al. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[172] Dorothea Kolossa,et al. WAPUSK20 - A Database for Robust Audiovisual Speech Recognition , 2010, LREC.

[173] Vijeta Sahu,et al. Result based analysis of various lip tracking systems , 2013, 2013 International Conference on Green High Performance Computing (ICGHPC).