Resource-Adaptive Deep Learning for Visual Speech Recognition

We focus on the problem of efficient architectures for lipreading that allow trading-off computational resources for visual speech recognition accuracy. In particular, we make two contributions: First, we introduce MobiLipNetV3, an efficient and accurate lipreading model, based on our earlier work on MobiLipNetV2 and incorporating recent advances in convolutional neural network architectures. Second, we propose a novel recognition paradigm, called MultiRate Ensemble (MRE), that combines a “lean” and a “full” MobiLipNetV3 in the lipreading pipeline, with the latter applied at a lower frame rate. This architecture yields a family of systems offering multiple accuracy vs. efficiency operating points depending on the frame-rate decimation of the “full” model, thus allowing adaptation to the available device resources. We evaluate our approach on the TCD-TIMIT corpus, popular in speaker-independent lipreading of continuous speech. The proposed MRE family of systems can be up to 73 times more efficient compared to residual neural network based lipreading, and up to twice as MobiLipNetV2, while in both cases reaching up to 8% absolute WER reduction, depending on the MRE chosen operating point. For example, a temporal decimation of three yields a 7% absolute WER reduction and a 26% relative decrease in computations over MobiLipNetV2.

[1]  Ahmed Hussen Abdelaziz Turbo Decoders for Audio-Visual Continuous Speech Recognition , 2017, INTERSPEECH.

[2]  Roger Zimmermann,et al.  MobiVSR : Efficient and Light-Weight Neural Network for Visual Speech Recognition on Mobile Devices , 2019, INTERSPEECH.

[3]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[4]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ahmed Hussen Abdelaziz Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Alex Zelinsky,et al.  Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf] , 2009, IEEE Robotics & Automation Magazine.

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[11]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[12]  Marian Verhelst,et al.  Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion , 2018, ArXiv.

[13]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[14]  Bin Ma,et al.  Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[17]  Shmuel Peleg,et al.  Seeing Through Noise: Visually Driven Speaker Separation And Enhancement , 2017, ICASSP.

[18]  Gerasimos Potamianos,et al.  MobiLipNet: Resource-Efficient Deep Learning Based Lipreading , 2019, INTERSPEECH.

[19]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[20]  Gerasimos Potamianos,et al.  Exploring ROI size in deep learning based lipreading , 2017, AVSP.

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Naomi Harte,et al.  Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition , 2018, ICMI.

[23]  Federico Sukno,et al.  Survey on automatic lip-reading in the era of deep learning , 2018, Image Vis. Comput..

[24]  Richard Harvey,et al.  Building Large-vocabulary Speaker-independent Lipreading Systems , 2018, INTERSPEECH.

[25]  Vaibhava Goel,et al.  Audio and visual modality combination in speech processing applications , 2017, The Handbook of Multimodal-Multisensor Interfaces, Volume 1.

[26]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[30]  Shmuel Peleg,et al.  Dynamic Temporal Alignment of Speech to Lips , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jian Sun,et al.  Face Alignment at 3000 FPS via Regressing Local Binary Features , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.