LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers

Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from the two streams are produced by self-attention, which are fed to the cross-attention module to achieve the alignment between visuals and landmarks. Finally, the resulting fused features can be decoded to output texts by a cascade seq2seq model. Experiments demonstrate that our method can effectively enhance the model generalization to unseen speakers.

[1]  Min Wang,et al.  Cross-Modal Retrieval with Heterogeneous Graph Embedding , 2022, ACM Multimedia.

[2]  Richang Hong,et al.  LCSNet: End-to-end Lipreading with Channel-aware Feature Selection , 2022, ACM Trans. Multim. Comput. Commun. Appl..

[3]  M. Pantic,et al.  Visual speech recognition for multiple languages in the wild , 2022, Nature Machine Intelligence.

[4]  Chaowei Fang,et al.  CALLip: Lipreading using Contrastive and Attribute Learning , 2021, ACM Multimedia.

[5]  Guoqing Feng,et al.  Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks , 2021, Applied Sciences.

[6]  Zhou Zhao,et al.  FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire , 2020, ACM Multimedia.

[7]  Shilin Wang,et al.  A Transformer-based Model for Sentence-Level Chinese Mandarin Lipreading , 2020, 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC).

[8]  Yandong Guo,et al.  Discriminative Multi-Modality Speech Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zheng-Jun Zha,et al.  Deep Coattention-Based Comparator for Relative Representation Learning in Person Re-Identification , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Xilin Chen,et al.  Mutual Information Maximization for Effective Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[11]  Haihong Tang,et al.  Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers , 2019, AAAI.

[12]  Mingli Song,et al.  A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading , 2019, MMAsia.

[13]  Sergio M. Savaresi,et al.  Automatic Detection of Driver Impairment Based on Pupillary Light Reflex , 2019, IEEE Transactions on Intelligent Transportation Systems.

[14]  Fan Yang,et al.  Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese , 2019, AAAI.

[15]  Shuang Xu,et al.  Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[16]  Brian Kan-Wing Mak,et al.  End-To-End Low-Resource Lip-Reading with Maxout Cnn and Lstm , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yuting Zhang,et al.  Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Ngoc Thang Vu,et al.  Investigations on End- to-End Audiovisual Fusion , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Kai Xu,et al.  LCANet: End-to-End Lipreading with Cascaded Attention-CTC , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[20]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jürgen Schmidhuber,et al.  Improving Speaker-Independent Lipreading with Domain-Adversarial Training , 2017, INTERSPEECH.

[22]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[23]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[24]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[26]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[27]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[28]  Jean-Philippe Thiran,et al.  On Dynamic Stream Weighting for Audio-Visual Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Richard Bowden,et al.  Cultural factors in the regression of non-verbal communication perception , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[30]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[31]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[32]  Jun Hwang,et al.  Lip print recognition for security systems by multi-resolution architecture , 2004, Future Gener. Comput. Syst..

[33]  Michael T. Chan,et al.  HMM-based audio-visual speech recognition integrating geometric- and appearance-based visual features , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[34]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[35]  G Papcun,et al.  Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. , 1992, The Journal of the Acoustical Society of America.

[36]  Yang Wang,et al.  3D PersonVLAD: Learning Deep Global Representations for Video-based Person Re-identification , 2018, ArXiv.

[37]  Barry-John Theobald,et al.  Comparison of human and machine-based lip-reading , 2009, AVSP.

[38]  Sridha Sridharan,et al.  Patch-based analysis of visual speech from multiple views , 2008, AVSP.