The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition
暂无分享,去创建一个
Chin-Hui Lee | Shinji Watanabe | Jia Pan | Jun Du | O. Scharenborg | Jianqing Gao | Jingdong Chen | Diyuan Liu | Hang Chen | Maokui He | S. Siniscalchi | Zhe Wang | Cong Liu | Baocai Yin | Shilong Wu
[1] Chin-Hui Lee,et al. Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis , 2022, INTERSPEECH.
[2] Chin-Hui Lee,et al. End-to-End Audio-Visual Neural Speaker Diarization , 2022, INTERSPEECH.
[3] Y. Qian,et al. The Sjtu System For Multimodal Information Based Speech Processing Challenge 2021 , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[4] Wei Li,et al. Channel-Wise AV-Fusion Attention for Multi-Channel Audio-Visual Speech Recognition , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[5] Sabato Marco Siniscalchi,et al. The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[6] Chin-Hui Lee,et al. The USTC-Ximalaya System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription (M2met) Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[8] Maja Pantic,et al. End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[9] Andreas Stolcke,et al. DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).
[10] Luk'avs Burget,et al. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks , 2020, Comput. Speech Lang..
[11] Kenneth Ward Church,et al. Third DIHARD Challenge Evaluation Plan , 2020, ArXiv.
[12] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.
[13] Yuri Y. Khokhlov,et al. The STC System for the CHiME-6 Challenge , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).
[14] Desh Raj,et al. The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).
[15] Jon Barker,et al. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).
[16] Yong Xu,et al. Self-Supervised Learning for Audio-Visual Speaker Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[17] Maja Pantic,et al. Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[18] Dong Wang,et al. CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[19] Syed Zubair,et al. Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model , 2019, Sensors.
[20] Yazan Abu Farha,et al. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Carlos Busso,et al. Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[22] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.
[23] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[24] Radu Horaud,et al. Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[25] Reinhold Haeb-Umbach,et al. NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.
[26] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.
[27] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Gwenn Englebienne,et al. Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[29] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .
[30] Lawrence D Rosenblum,et al. Speech Perception as a Multimodal Phenomenon , 2008, Current directions in psychological science.
[31] Xavier Anguera Miró,et al. Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.
[32] John S. Garofolo,et al. The Rich Transcription 2006 Spring Meeting Recognition Evaluation , 2006, Machine Learning for Multimodal Interaction.
[33] Hani Yehia,et al. Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..