An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading

As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.

[1]  Richard Harvey,et al.  Comparing phonemes and visemes with DNN-based lipreading , 2018, ArXiv.

[2]  John Hale,et al.  LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better , 2018, ACL.

[3]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[4]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[5]  F. Deland,et al.  The story of lip-reading : its genesis and development , 1968 .

[6]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[7]  Jungang Xu,et al.  A Survey on Neural Network Language Models , 2019, ArXiv.

[8]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[9]  Dominic Howell,et al.  Confusion modelling for lip-reading , 2015 .

[10]  Stephen J. Cox,et al.  Visual units and confusion modelling for automatic lip-reading , 2016, Image Vis. Comput..

[11]  Stephen J. Cox,et al.  Improving lip-reading performance for robust audiovisual speech recognition using DNNs , 2015, AVSP.

[12]  Valentin I. Spitkovsky,et al.  From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing , 2010, NAACL.

[13]  Barry-John Theobald,et al.  Insights into machine lip reading , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Richard Harvey,et al.  Improving Computer Lipreading via DNN Sequence Discriminative Training Techniques , 2017, INTERSPEECH.

[15]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[16]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[19]  Naomi Harte,et al.  Towards Lipreading Sentences with Active Appearance Models , 2017, AVSP.

[20]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[21]  Yulia Tsvetkov,et al.  Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning , 2016, ACL.

[22]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[23]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[24]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[25]  R. Treiman,et al.  Context sensitivity in the spelling of English vowels , 2002 .

[26]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[27]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[30]  Richard Harvey,et al.  Decoding visemes: Improving machine lip-reading , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Maja Pantic,et al.  End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[33]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Barry-John Theobald,et al.  Improving visual features for lip-reading , 2010, AVSP.

[35]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[36]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[37]  Daqing Chen,et al.  Decoder-Encoder LSTM for Lip Reading , 2019, ICSIE.

[38]  David Barber,et al.  Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification , 2017, AISTATS.

[39]  Abram Handler,et al.  Bag of What? Simple Noun Phrase Extraction for Text Analysis , 2016, NLP+CSS@EMNLP.

[40]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[41]  E. Erzin,et al.  Comparison of Phoneme and Viseme Based Acoustic Units for Speech Driven Realistic lip Animation , 2007, 2007 3DTV Conference.

[42]  Deyi Xiong,et al.  A GRU-Gated Attention Model for Neural Machine Translation , 2017, ArXiv.

[43]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[44]  Perry Xiao,et al.  Lip Reading Sentences Using Deep Learning With Only Visual Cues , 2020, IEEE Access.

[45]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[46]  Nasser Mozayani,et al.  Lip reading using external viseme decoding , 2021, ArXiv.

[47]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[48]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[49]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[50]  Dongsuk Yook,et al.  Audio-to-Visual Conversion Using Hidden Markov Models , 2002, PRICAI.

[51]  Orhan Firat,et al.  On the Importance of Word Boundaries in Character-level Neural Machine Translation , 2019, EMNLP.