Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.

[1]  Dae Hoe Kim,et al.  AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model , 2023, IEEE Transactions on Multimedia.

[2]  Dahun Kim,et al.  Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation , 2023, ArXiv.

[3]  Y. Ro,et al.  Intelligible Lip-to-Speech Synthesis with Speech Units , 2023, INTERSPEECH 2023.

[4]  Brian Yan,et al.  Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning , 2023, INTERSPEECH 2023.

[5]  Y. Ro,et al.  Multi-Temporal Lip-Audio Memory for Visual Speech Recognition , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  M. Pantic,et al.  Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Y. Ro,et al.  Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Y. Ro,et al.  Lip-to-Speech Synthesis in the Wild with Multi-task Learning , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Y. Ro,et al.  Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition , 2023, ArXiv.

[10]  Jinyu Li,et al.  VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning , 2022, IEEE Transactions on Multimedia.

[11]  Y. Ro,et al.  Speaker-adaptive Lip Reading with User-dependent Padding , 2022, ECCV.

[12]  Y. Ro,et al.  Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition , 2022, INTERSPEECH.

[13]  Yong Man Ro,et al.  Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading , 2022, AAAI.

[14]  Y. Ro,et al.  Lip to Speech Synthesis with Visual Context Attentional GAN , 2022, NeurIPS.

[15]  M. Pantic,et al.  Visual speech recognition for multiple languages in the wild , 2022, Nature Machine Intelligence.

[16]  Abdel-rahman Mohamed,et al.  Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction , 2022, ICLR.

[17]  Abdel-rahman Mohamed,et al.  Robust Self-Supervised Audio-Visual Speech Recognition , 2022, INTERSPEECH.

[18]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[20]  Triantafyllos Afouras,et al.  Sub-word Level Lip Reading With Visual Attention , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yong Man Ro,et al.  Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Maja Pantic,et al.  LiRA: Learning Visual Speech Representations from Audio through Self-supervision , 2021, Interspeech.

[23]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Guoqiang Han,et al.  Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Eugene Kharitonov,et al.  Speech Resynthesis from Discrete Disentangled Self-Supervised Representations , 2021, Interspeech.

[26]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27]  Maja Pantic,et al.  End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Douglas W. Oard,et al.  The Multilingual TEDx Corpus for Speech Recognition and Translation , 2021, Interspeech.

[29]  Emmanuel Dupoux,et al.  On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.

[30]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[31]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Shuang Yang,et al.  Learn an Effective Lip Reading Model without Pains , 2020, ArXiv.

[33]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[34]  Zhou Zhao,et al.  DeVLBert: Learning Deconfounded Visio-Linguistic Representations , 2020, ACM Multimedia.

[35]  Maja Pantic,et al.  Towards Practical Lipreading with Distilled and Efficient Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[37]  S. Shan,et al.  Synchronous Bidirectional Learning for Multilingual Lip Reading , 2020, BMVC.

[38]  Xilin Chen,et al.  Mutual Information Maximization for Effective Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[39]  Shuang Yang,et al.  Deformation Flow Based Two-Stream Network for Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[40]  Shuang Yang,et al.  Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[41]  Yong Man Ro,et al.  Lightweight and Effective Facial Landmark Detection using Adversarial Learning with Face Geometric Map Generative Network , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[42]  Maja Pantic,et al.  Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Joon Son Chung,et al.  ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Haihong Tang,et al.  Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers , 2019, AAAI.

[45]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[46]  Michael Auli,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[47]  Shilin Wang,et al.  Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[49]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[50]  Mingli Song,et al.  A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading , 2019, MMAsia.

[51]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[52]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[53]  Kris Kitani,et al.  Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading , 2019, BMVC.

[54]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[56]  Jian Yang,et al.  DSFD: Dual Shot Face Detector , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Shiguang Shan,et al.  LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild , 2018, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[58]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[60]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[61]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[62]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[64]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[65]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[66]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[67]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[68]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[69]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[71]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[72]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[74]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[75]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[76]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[77]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[78]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[79]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[80]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[81]  Naveen C. Kumar,et al.  WORKSHOPS , 1993, 2022 Moratuwa Engineering Research Conference (MERCon).

[82]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[83]  Yong Man Ro,et al.  CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition , 2022, IEEE Transactions on Multimedia.

[84]  Yong Man Ro,et al.  Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[85]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[86]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .