Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
暂无分享,去创建一个
Y. Ro | Minsu Kim | Jeong Hun Yeo | J. Choi
[1] Dae Hoe Kim,et al. AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model , 2023, IEEE Transactions on Multimedia.
[2] Dahun Kim,et al. Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation , 2023, ArXiv.
[3] Y. Ro,et al. Intelligible Lip-to-Speech Synthesis with Speech Units , 2023, INTERSPEECH 2023.
[4] Brian Yan,et al. Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning , 2023, INTERSPEECH 2023.
[5] Y. Ro,et al. Multi-Temporal Lip-Audio Memory for Visual Speech Recognition , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[6] M. Pantic,et al. Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] Y. Ro,et al. Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Y. Ro,et al. Lip-to-Speech Synthesis in the Wild with Multi-task Learning , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[9] Y. Ro,et al. Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition , 2023, ArXiv.
[10] Jinyu Li,et al. VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning , 2022, IEEE Transactions on Multimedia.
[11] Y. Ro,et al. Speaker-adaptive Lip Reading with User-dependent Padding , 2022, ECCV.
[12] Y. Ro,et al. Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition , 2022, INTERSPEECH.
[13] Yong Man Ro,et al. Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading , 2022, AAAI.
[14] Y. Ro,et al. Lip to Speech Synthesis with Visual Context Attentional GAN , 2022, NeurIPS.
[15] M. Pantic,et al. Visual speech recognition for multiple languages in the wild , 2022, Nature Machine Intelligence.
[16] Abdel-rahman Mohamed,et al. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction , 2022, ICLR.
[17] Abdel-rahman Mohamed,et al. Robust Self-Supervised Audio-Visual Speech Recognition , 2022, INTERSPEECH.
[18] Fang Wen,et al. Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Zi-Yi Dou,et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.
[20] Triantafyllos Afouras,et al. Sub-word Level Lip Reading With Visual Attention , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Yong Man Ro,et al. Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[22] Maja Pantic,et al. LiRA: Learning Visual Speech Representations from Audio through Self-supervision , 2021, Interspeech.
[23] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[24] Guoqiang Han,et al. Learning from the Master: Distilling Cross-modal Advanced Knowledge for Lip Reading , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Eugene Kharitonov,et al. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations , 2021, Interspeech.
[26] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[27] Maja Pantic,et al. End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[28] Douglas W. Oard,et al. The Multilingual TEDx Corpus for Speech Recognition and Translation , 2021, Interspeech.
[29] Emmanuel Dupoux,et al. On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.
[30] Emmanuel Dupoux,et al. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.
[31] B. Ommer,et al. Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Shuang Yang,et al. Learn an Effective Lip Reading Model without Pains , 2020, ArXiv.
[33] Gabriel Synnaeve,et al. MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.
[34] Zhou Zhao,et al. DeVLBert: Learning Deconfounded Visio-Linguistic Representations , 2020, ACM Multimedia.
[35] Maja Pantic,et al. Towards Practical Lipreading with Distilled and Efficient Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[36] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.
[37] S. Shan,et al. Synchronous Bidirectional Learning for Multilingual Lip Reading , 2020, BMVC.
[38] Xilin Chen,et al. Mutual Information Maximization for Effective Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).
[39] Shuang Yang,et al. Deformation Flow Based Two-Stream Network for Lip Reading , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).
[40] Shuang Yang,et al. Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).
[41] Yong Man Ro,et al. Lightweight and Effective Facial Landmark Detection using Adversarial Learning with Face Geometric Map Generative Network , 2020, IEEE Transactions on Circuits and Systems for Video Technology.
[42] Maja Pantic,et al. Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[43] Joon Son Chung,et al. ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[44] Haihong Tang,et al. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers , 2019, AAAI.
[45] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[46] Michael Auli,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.
[47] Shilin Wang,et al. Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[49] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[50] Mingli Song,et al. A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading , 2019, MMAsia.
[51] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[52] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[53] Kris Kitani,et al. Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading , 2019, BMVC.
[54] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[55] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[56] Jian Yang,et al. DSFD: Dual Shot Face Detector , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Shiguang Shan,et al. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild , 2018, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).
[58] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[59] Joon Son Chung,et al. LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.
[60] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.
[61] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.
[62] Maja Pantic,et al. End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[63] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.
[64] John R. Hershey,et al. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.
[65] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[66] Themos Stafylakis,et al. Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.
[67] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.
[68] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.
[69] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Shimon Whiteson,et al. LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.
[71] Stephen J. Cox,et al. Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[72] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[73] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[74] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[75] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[76] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.
[77] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[78] Ngoc Thang Vu,et al. Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[79] Tanja Schultz,et al. Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..
[80] S. Hochreiter,et al. Long Short-Term Memory , 1997, Neural Computation.
[81] Naveen C. Kumar,et al. WORKSHOPS , 1993, 2022 Moratuwa Engineering Research Conference (MERCon).
[82] David Taylor. Hearing by Eye: The Psychology of Lip-Reading , 1988 .
[83] Yong Man Ro,et al. CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition , 2022, IEEE Transactions on Multimedia.
[84] Yong Man Ro,et al. Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[85] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[86] Alex Graves,et al. Connectionist Temporal Classification , 2012 .