论文信息 - Deep Lip Reading: a comparison of models and an online application

Deep Lip Reading: a comparison of models and an online application

The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition. We develop three architectures and compare their accuracy and training times: (i) a recurrent model using LSTMs; (ii) a fully convolutional model; and (iii) the recently proposed transformer model. The recurrent and fully convolutional models are trained with a Connectionist Temporal Classification loss and use an explicit language model for decoding, the transformer is a sequence-to-sequence model. Our best performing model improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent. As a further contribution we investigate the fully convolutional model when used for online (real time) lip reading of continuous speech, and show that it achieves high performance with low latency.

Joon Son Chung | Andrew Zisserman | Triantafyllos Afouras | Andrew Zisserman | Triantafyllos Afouras

[1] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[2] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3] Yann Dauphin,et al. A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[4] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[7] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8] Joon Son Chung,et al. Learning to lip read words by watching videos , 2018, Comput. Vis. Image Underst..

[9] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Tara N. Sainath,et al. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[13] Colin Raffel,et al. Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[14] Zhiheng Huang,et al. Residual Convolutional CTC Networks for Automatic Speech Recognition , 2017, ArXiv.

[15] Daniel Jurafsky,et al. Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[16] Jinyu Li,et al. Improved training for online end-to-end speech recognition systems , 2017, INTERSPEECH.

[17] Samy Bengio,et al. An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[18] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19] Maja Pantic,et al. End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Bhuvana Ramabhadran,et al. End-to-end speech recognition and keyword search on low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Wonyong Sung,et al. Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification , 2015, ArXiv.

[22] Gabriel Synnaeve,et al. Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[23] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[24] Kai Yu,et al. Phone Synchronous Speech Recognition With CTC Lattices , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Lukasz Kaiser,et al. Depthwise Separable Convolutions for Neural Machine Translation , 2017, ICLR.

[26] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27] Iasonas Kokkinos,et al. Learning Filterbanks from Raw Speech for Phone Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Alex Graves,et al. Neural Machine Translation in Linear Time , 2016, ArXiv.

[29] Navdeep Jaitly,et al. Learning online alignments with continuous rewards policy gradient , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[31] Themos Stafylakis,et al. Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[32] Hermann Ney,et al. CTC in the Context of Generalized Full-Sum HMM Training , 2017, INTERSPEECH.

[33] Ying Zhang,et al. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[34] Vladlen Koltun,et al. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[35] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[36] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[37] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[38] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[39] Shimon Whiteson,et al. LipNet: Sentence-level Lipreading , 2016, ArXiv.

[40] Joon Son Chung,et al. Lip Reading in Profile , 2017, BMVC.

[41] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43] Shimon Whiteson,et al. LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.