Lip-to-Speech Synthesis in the Wild with Multi-task Learning
暂无分享,去创建一个
[1] M. Pantic,et al. SVTS: Scalable Video-to-Speech Synthesis , 2022, INTERSPEECH.
[2] Yong Man Ro,et al. Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading , 2022, AAAI.
[3] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[4] Yong Man Ro,et al. CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition , 2022, IEEE Transactions on Multimedia.
[5] Y. Ro,et al. Lip to Speech Synthesis with Visual Context Attentional GAN , 2022, NeurIPS.
[6] Maja Pantic,et al. End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] Yong Man Ro,et al. Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[8] C. V. Jawahar,et al. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.
[10] Jesper Jensen,et al. Vocoder-Based Speech Synthesis from Silent Videos , 2020, INTERSPEECH.
[11] Maja Pantic,et al. Video-Driven Speech Reconstruction using Generative Adversarial Networks , 2019, INTERSPEECH.
[12] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[13] Joon Son Chung,et al. LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.
[14] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.
[15] Maja Pantic,et al. End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[16] Liangliang Cao,et al. Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[17] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.
[18] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.
[20] Jesper Jensen,et al. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[21] Naomi Harte,et al. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.
[22] Jesper Jensen,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
[23] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.
[24] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.
[25] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[26] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.