Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
暂无分享,去创建一个
Kwanghoon Sohn | Hong-Goo Kang | Sunok Kim | Jiyoung Lee | Soo-Whan Chung | Hong-Goo Kang | K. Sohn | Soo-Whan Chung | Sunok Kim | Jiyoung Lee
[1] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.
[2] Joon Son Chung,et al. Lip Reading in Profile , 2017, BMVC.
[3] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Ruize Wang,et al. Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning , 2020, ACM Multimedia.
[5] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.
[6] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[7] Kwanghoon Sohn,et al. SumGraph: Video Summarization via Recursive Graph Modeling , 2020, ECCV.
[8] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).
[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[10] Joon Son Chung,et al. My lips are concealed: Audio-visual speech enhancement through obstructions , 2019, INTERSPEECH.
[11] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.
[12] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[13] Chung-Hsien Wu,et al. Fully complex deep neural network for phase-incorporating monaural source separation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[14] Joon Son Chung,et al. Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval , 2020, IEEE Journal of Selected Topics in Signal Processing.
[15] Joon Son Chung,et al. LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.
[16] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[17] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.
[18] Geoffrey E. Hinton,et al. Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[19] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[21] P. Marler,et al. Communication Goes Multimodal , 1999, Science.
[22] Dong Yu,et al. Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[23] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[24] Hong-Goo Kang,et al. A Joint Learning Algorithm for Complex-Valued T-F Masks in Deep Learning-Based Single-Channel Speech Enhancement Systems , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[25] Changshui Zhang,et al. Listen and Look: Audio–Visual Matching Assisted Speech Source Separation , 2018, IEEE Signal Processing Letters.
[26] Joon Son Chung,et al. The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.
[27] Yu Tsao,et al. Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.
[28] Timo Gerkmann,et al. STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[29] N. Mesgarani,et al. Selective cortical representation of attended speaker in multi-talker speech perception , 2012, Nature.
[30] Chenda Li,et al. Deep Audio-Visual Speech Separation with Attention Mechanism , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[31] Joon Son Chung,et al. Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[32] John R. Hershey,et al. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.
[33] Yoav Y. Schechner,et al. Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.
[34] W. H. Sumby,et al. Visual contribution to speech intelligibility in noise , 1954 .
[35] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[36] Andrew Owens,et al. Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.
[37] Josh H. McDermott. The cocktail party problem , 2009, Current Biology.
[38] Joon Son Chung,et al. FaceFilter: Audio-visual speech separation using still images , 2020, INTERSPEECH.
[39] Ruslan Salakhutdinov,et al. Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.
[40] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[41] Albert S. Bregman,et al. The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .
[42] Yoohwan Kwon,et al. Intra-class variation reduction of speaker representation in disentanglement framework , 2020, INTERSPEECH.
[43] George Trigeorgis,et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[44] Jung-Woo Ha,et al. Phase-aware Speech Enhancement with Deep Complex U-Net , 2019, ICLR.
[45] Sandeep Subramanian,et al. Deep Complex Networks , 2017, ICLR.
[46] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.
[47] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[48] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..
[49] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.
[50] Soo-Whan Chung,et al. End-To-End Lip Synchronisation Based on Pattern Classification , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).
[51] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[52] Chuang Gan,et al. Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[54] Seungryong Kim,et al. Multi-Modal Recurrent Attention Networks for Facial Expression Recognition , 2020, IEEE Transactions on Image Processing.
[55] Erik McDermott,et al. Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[56] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[57] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.
[58] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[59] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.