CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.

[1]  Khalid Choukri,et al.  The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms , 2007, Lang. Resour. Evaluation.

[2]  Shinji Watanabe,et al.  Acoustic Modeling for Overlapping Speech Recognition: Jhu Chime-5 Challenge System , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Reinhold Haeb-Umbach,et al.  Front-end processing for the CHiME-5 dinner party scenario , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[4]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[5]  Heiga Zen,et al.  Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques , 2019, IEEE Signal Processing Magazine.

[6]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Emmanuel Vincent,et al.  A French Corpus for Distant-Microphone Speech Processing in Real Homes , 2016, INTERSPEECH.

[8]  Jon Barker,et al.  The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[9]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[10]  Emmanuel Vincent,et al.  VoiceHome-2, an extended corpus for multichannel speech processing in real homes , 2019, Speech Commun..

[11]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[12]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Lori Lamel,et al.  The translanguage English database (TED) , 1994, ICSLP.

[15]  Jeff A. Bilmes,et al.  The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments , 2012, Comput. Speech Lang..

[16]  H. Bourlard,et al.  Interpretation of Multiparty Meetings the AMI and Amida Projects , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[17]  Walter Kellermann,et al.  WOZ acoustic data collection for interactive TV , 2008, Lang. Resour. Evaluation.

[18]  Khalid Choukri,et al.  SPEECHDAT-CAR. A Large Speech Database for Automotive Environments , 2000, LREC.

[19]  Sanjeev Khudanpur,et al.  Acoustic Modelling from the Signal Domain Using CNNs , 2016, INTERSPEECH.

[20]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[21]  Brigitte Meillon,et al.  The Sweet-Home speech and multimodal corpus for home automation interaction , 2014, LREC.

[22]  Björn Schuller,et al.  Audio Source Separation , 2013 .

[23]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[24]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Mary Harper The Automatic Speech recogition In Reverberant Environments (ASpIRE) challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Thomas Hain,et al.  The sheffield wargames corpus , 2013, INTERSPEECH.

[28]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  Friedrich Faubel,et al.  Recognition of overlapping speech using digital MEMS microphone arrays , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Emmanuel Vincent,et al.  Audio Source Separation and Speech Enhancement , 2018 .

[31]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[32]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[34]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[35]  Florian Metze,et al.  New Era for Robust Speech Recognition , 2017, Springer International Publishing.

[36]  John H. L. Hansen,et al.  "CU-move" : analysis & corpus development for interactive in-vehicle speech systems , 2001, INTERSPEECH.

[37]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[38]  Geoffrey Zweig,et al.  Toward Human Parity in Conversational Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[40]  Roland Maas,et al.  DiPCo - Dinner Party Corpus , 2019, INTERSPEECH.

[41]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[43]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Lei Sun,et al.  The USTC-iFlytek systems for CHiME-5 Challenge , 2018 .

[45]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[46]  Colleen Richey,et al.  Voices Obscured in Complex Environmental Settings (VOICES) corpus , 2018, INTERSPEECH.

[47]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[48]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).