Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. The proposed method is mainly composed of two parts: 1) MTLAM saves multi-temporal audio features produced from short- and long-term audio signals, and the MTLAM memorizes a visual-to-audio mapping to load stored multi-temporal audio features from visual features at the inference phase. 2) We design an audio temporal model to produce multi-temporal audio features capturing the context of neighboring words. In addition, to construct effective visual-to-audio mapping, the audio temporal models can generate audio features time-aligned with visual features. Through extensive experiments, we validate the effectiveness of the MTLAM achieving state-of-the-art performances on two public VSR datasets.

[1]  Ye Yan,et al.  Improved Word-level Lipreading with Temporal Shrinkage Network and NetVLAD , 2022, ICMI.

[2]  Y. Ro,et al.  Speaker-adaptive Lip Reading with User-dependent Padding , 2022, ECCV.

[3]  M. Pantic,et al.  Training Strategies for Improved Lip-Reading , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Zhong-Qiu Zhao,et al.  Lipreading Model Based On Whole-Part Collaborative Learning , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yong Man Ro,et al.  Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading , 2022, AAAI.

[6]  Triantafyllos Afouras,et al.  Sub-word Level Lip Reading With Visual Attention , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yong Man Ro,et al.  CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition , 2022, IEEE Transactions on Multimedia.

[8]  Maja Pantic,et al.  LiRA: Learning Visual Speech Representations from Audio through Self-supervision , 2021, Interspeech.

[9]  Guoqiang Han,et al.  Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Maja Pantic,et al.  End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Maja Pantic,et al.  Lip-reading with Densely Connected Temporal Convolutional Networks , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  Maja Pantic,et al.  Towards Practical Lipreading with Distilled and Efficient Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yandong Guo,et al.  Discriminative Multi-Modality Speech Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xilin Chen,et al.  Mutual Information Maximization for Effective Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[15]  Shuang Yang,et al.  Deformation Flow Based Two-Stream Network for Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[16]  Xilin Chen,et al.  Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[17]  Shuang Yang,et al.  Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[18]  Maja Pantic,et al.  Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Haihong Tang,et al.  Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers , 2019, AAAI.

[20]  Chenhao Wang,et al.  Multi-Grained Spatio-temporal Modeling for Lip-reading , 2019, BMVC.

[21]  Kris Kitani,et al.  Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading , 2019, BMVC.

[22]  Shiguang Shan,et al.  LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild , 2018, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[23]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[25]  R. Sataloff,et al.  The human voice. , 1992, Scientific American.