Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Current state-of-the-art approaches for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition. Hence, these methods do not fully exploit the characteristics of the lip dynamics, causing two main drawbacks. First, the short-range temporal dependencies, which are critical to the mapping from lip images to visemes, receives no extra attention. Second, local spatial information is discarded in the existing sequence models due to the use of global average pooling (GAP). To well solve these drawbacks, we propose a Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well. From the experiment results, it is demonstrated that our method achieves comparable performance with the state-of-the-art approach using much less training data and much lighter Convolutional Feature Extractor. The training time is reduced by 12 days due to the convolutional structure and the local self-attention mechanism.

[1]  Wei Shi,et al.  Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification , 2016, ACL.

[2]  Shimon Whiteson,et al.  LipNet: Sentence-level Lipreading , 2016, ArXiv.

[3]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[4]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Li Wang,et al.  A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model for Abstractive Text Summarization , 2018, IJCAI.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[8]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[9]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[10]  Demetri Terzopoulos,et al.  Snakes: Active contour models , 2004, International Journal of Computer Vision.

[11]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[12]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[13]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[14]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[15]  Hanqing Lu,et al.  Reading Scene Text with Attention Convolutional Sequence Modeling , 2017, ArXiv.

[16]  Pasquale Pagano,et al.  OpenDLib: A Digital Library Service System , 2002, ECDL.

[17]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Ali Aghagolzadeh,et al.  Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology , 2010, Pattern Recognit..

[20]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[21]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Katsushi Ikeuchi,et al.  A Spherical Representation for Recognition of Free-Form Surfaces , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[24]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[26]  M. Studdert-Kennedy,et al.  Hemispheric specialization for speech perception. , 1970, The Journal of the Acoustical Society of America.

[27]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[28]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[29]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[30]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[32]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[33]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[35]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[38]  Feng Cheng,et al.  Visual speaker authentication with random prompt texts by a dual-task CNN framework , 2018, Pattern Recognit..

[39]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[40]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.