The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

[1]  Chin-Hui Lee,et al.  Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis , 2022, INTERSPEECH.

[2]  Chin-Hui Lee,et al.  End-to-End Audio-Visual Neural Speaker Diarization , 2022, INTERSPEECH.

[3]  Y. Qian,et al.  The Sjtu System For Multimodal Information Based Speech Processing Challenge 2021 , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Wei Li,et al.  Channel-Wise AV-Fusion Attention for Multi-Channel Audio-Visual Speech Recognition , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Sabato Marco Siniscalchi,et al.  The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Chin-Hui Lee,et al.  The USTC-Ximalaya System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription (M2met) Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Maja Pantic,et al.  End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Andreas Stolcke,et al.  DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[10]  Luk'avs Burget,et al.  Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks , 2020, Comput. Speech Lang..

[11]  Kenneth Ward Church,et al.  Third DIHARD Challenge Evaluation Plan , 2020, ArXiv.

[12]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[13]  Yuri Y. Khokhlov,et al.  The STC System for the CHiME-6 Challenge , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[14]  Desh Raj,et al.  The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[15]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[16]  Yong Xu,et al.  Self-Supervised Learning for Audio-Visual Speaker Diarization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Maja Pantic,et al.  Lipreading Using Temporal Convolutional Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Dong Wang,et al.  CN-Celeb: A Challenging Chinese Speaker Recognition Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Syed Zubair,et al.  Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model , 2019, Sensors.

[20]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Carlos Busso,et al.  Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[23]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[26]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[27]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[30]  Lawrence D Rosenblum,et al.  Speech Perception as a Multimodal Phenomenon , 2008, Current directions in psychological science.

[31]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  John S. Garofolo,et al.  The Rich Transcription 2006 Spring Meeting Recognition Evaluation , 2006, Machine Learning for Multimodal Interaction.

[33]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..