Building Large-vocabulary Speaker-independent Lipreading Systems

Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is much rarer. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several steps to pre-process visual features. Moreover, we examine the contribution of language modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCD-TIMIT audiovisual speech corpus. The results show that visual speech recognition can definitely reach 50% word accuracy on large vocabularies. We actually achieved a mean of 53.83% measured via three-fold cross-validation on the speaker independent setting of the TCDTIMIT corpus using bigrams.

[1]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[2]  John Makhoul,et al.  Speaker adaptive training: a maximum likelihood approach to speaker normalization , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[6]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[7]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[8]  Ahmed Hussen Abdelaziz,et al.  NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition , 2017, INTERSPEECH.

[9]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[10]  Maja Pantic,et al.  End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  I. R. Rodríguez Ortiz Lipreading in the prelingually deaf: what makes a skilled speechreader? , 2008, The Spanish journal of psychology.

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[14]  Lisa Southwick,et al.  Chapter 31 – Patients with Disabilities , 2008 .

[15]  Farshad Almasganj,et al.  Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features , 2017, 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA).

[16]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[17]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Karel Palecek,et al.  Extraction of Features for Lip-reading Using Autoencoders , 2014, SPECOM.

[19]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[20]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[23]  G. H. Nicholls,et al.  Cued Speech and the reception of spoken language. , 1982, Journal of speech and hearing research.

[24]  Richard Harvey,et al.  Improving Computer Lipreading via DNN Sequence Discriminative Training Techniques , 2017, INTERSPEECH.