Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR

In this paper, we present Hitachi and Paderborn University's joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) heavy speech overlaps, (2) severe noise and reverberation, (3) very natural conversational content, and possibly (4) insufficient training data. As an example of a dinner party scenario, we have chosen the data presented during the CHiME-5 speech recognition challenge, where the baseline ASR had a 73.3% word error rate (WER), and even the best performing system at the CHiME-5 challenge had a 46.1% WER. We extensively investigated a combination of the guided source separation-based speech enhancement technique and an already proposed strong ASR backend and found that a tight combination of these techniques provided substantial accuracy improvements. Our final system achieved WERs of 39.94% and 41.64% for the development and evaluation data, respectively, both of which are the best published results for the dataset. We also investigated with additional training data on the official small data in the CHiME-5 corpus to assess the intrinsic difficulty of this ASR task.

[1]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[2]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[5]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[6]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[8]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[9]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Xiaofei Wang,et al.  The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays , 2018 .

[11]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  Naoyuki Kanda,et al.  Lattice-free State-level Minimum Bayes Risk Training of Acoustic Models , 2018, INTERSPEECH.

[14]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[15]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[16]  Naoyuki Kanda,et al.  Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Xiong Xiao,et al.  Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks , 2018, INTERSPEECH.

[18]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[19]  Lei Sun,et al.  The USTC-iFlytek systems for CHiME-5 Challenge , 2018 .

[20]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[22]  Jon Barker,et al.  The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[25]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Hermann Ney,et al.  The RWTH/UPB system combination for the CHiME 2018 Workshop , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[27]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[28]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[29]  Reinhold Haeb-Umbach,et al.  Front-end processing for the CHiME-5 dinner party scenario , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[30]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[31]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[32]  Tomohiro Nakatani,et al.  Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).