MobiLipNet: Resource-Efficient Deep Learning Based Lipreading

Recent works in visual speech recognition utilize deep learning advances to improve accuracy. Focus however has been primarily on recognition performance, while ignoring the computational burden of deep architectures. In this paper we address these issues concurrently, aiming at both high computational efficiency and recognition accuracy in lipreading. For this purpose, we investigate the MobileNet convolutional neural network architectures, recently proposed for image classification. In addition, we extend the 2D convolutions of MobileNets to 3D ones, in order to better model the spatio-temporal nature of the lipreading problem. We investigate two architectures in this extension, introducing the temporal dimension as part of either the depthwise or the pointwise MobileNet convolutions. To further boost computational efficiency, we also consider using pointwise convolutions alone, as well as networks operating on half the mouth region. We evaluate the proposed architectures on speaker-independent visual-only continuous speech recognition on the popular TCD-TIMIT corpus. Our best system outperforms a baseline CNN by 4.27% absolute in word error rate and over 12 times in computational efficiency, whereas, compared to a state-of-the-art ResNet, it is 37 times more efficient at a minor 0.07% absolute error rate degradation.

[1]  Marian Verhelst,et al.  Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion , 2018, ArXiv.

[2]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[3]  Bin Ma,et al.  Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[5]  Matti Pietikäinen,et al.  OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[6]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[7]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[8]  Gerasimos Potamianos,et al.  Exploiting lower face symmetry in appearance-based automatic speechreading , 2005, AVSP.

[9]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[10]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[12]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[13]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[14]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[15]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[16]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[17]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ahmed Hussen Abdelaziz Turbo Decoders for Audio-Visual Continuous Speech Recognition , 2017, INTERSPEECH.

[22]  Kurt Keutzer,et al.  Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[24]  Gerasimos Potamianos,et al.  Exploring ROI size in deep learning based lipreading , 2017, AVSP.

[25]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[26]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[27]  Mark Hasegawa-Johnson,et al.  Speaker Adaptive Audio-Visual Fusion for the Open-Vocabulary Section of AVICAR , 2018, INTERSPEECH.

[28]  Naomi Harte,et al.  Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition , 2018, ICMI.

[29]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  Ahmed Hussen Abdelaziz Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Jian Sun,et al.  Face Alignment at 3000 FPS via Regressing Local Binary Features , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Joon Son Chung,et al.  Deep Lip Reading: a comparison of models and an online application , 2018, INTERSPEECH.

[34]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[35]  Shmuel Peleg,et al.  Seeing Through Noise: Visually Driven Speaker Separation And Enhancement , 2017, ICASSP.

[36]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[37]  Rajiv Ratn Shah,et al.  MobiVSR: A Visual Speech Recognition Solution for Mobile Devices , 2019, ArXiv.

[38]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Vaibhava Goel,et al.  Audio and visual modality combination in speech processing applications , 2017, The Handbook of Multimodal-Multisensor Interfaces, Volume 1.

[40]  Gary R. Bradski,et al.  Learning OpenCV - computer vision with the OpenCV library: software that sees , 2008 .

[41]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[42]  Federico Sukno,et al.  Survey on automatic lip-reading in the era of deep learning , 2018, Image Vis. Comput..

[43]  Richard Harvey,et al.  Building Large-vocabulary Speaker-independent Lipreading Systems , 2018, INTERSPEECH.

[44]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[45]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[46]  Shmuel Peleg,et al.  Dynamic Temporal Alignment of Speech to Lips , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).