An analysis of environment, microphone and data simulation mismatches in robust speech recognition

An analysis of the impact of acoustic mismatches between training and test data on the performance of robust ASR.Including: environment, microphone and data simulation mismatches.Based on: a critical analysis of the results published on the CHiME-3 dataset and new experiments.Result: with the exception of MVDR beamforming, these mismatches have little effect on the ASR performance.Contribution: the CHiME-4 challenge, which revisits the CHiME-3 dataset and reduces the number of microphones available for testing. Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.

[1]  Daniel P. W. Ellis,et al.  Model-Based Expectation-Maximization Source Separation and Localization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[3]  Jacob Benesty,et al.  Speech Processing in Modern Communication: Challenges and Perspectives , 2012 .

[4]  Takuya Yoshioka,et al.  Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  Tuomas Virtanen,et al.  Modelling non-stationary noise with spectral factorisation in automatic speech recognition , 2013, Comput. Speech Lang..

[7]  Lukás Burget,et al.  iVector-based discriminative adaptation for automatic speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Masahito Togami,et al.  Unified ASR system using LGM-based source separation, noise-robust feature extraction, and word hypothesis selection , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[9]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[10]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[11]  Jonathan Le Roux,et al.  The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12]  Alessio Brutti,et al.  On the relationship between Early-to-Late Ratio of Room Impulse Responses and ASR performance in reverberant environments , 2016, Speech Commun..

[13]  Jon Barker,et al.  The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[14]  Koichi Shinoda Speaker adaptation techniques for automatic speech recognition , 2011 .

[15]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[16]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[17]  Shoko Araki,et al.  Equivalence between Frequency-Domain Blind Source Separation and Frequency-Domain Adaptive Beamforming for Convolutive Mixtures , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Jeff A. Bilmes,et al.  The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments , 2012, Comput. Speech Lang..

[19]  Antoine Liutkus,et al.  Robust ASR using neural network based speech enhancement and feature simulation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[22]  Walter Kellermann,et al.  Robust coherence-based spectral enhancement for distant speech recognition , 2015, ArXiv.

[23]  Birger Kollmeier,et al.  A CHiME-3 challenge system: Long-term acoustic features for noise robust automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[24]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[25]  Mary Harper The Automatic Speech recogition In Reverberant Environments (ASpIRE) challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[27]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[28]  Yongqiang Wang,et al.  Adaptation of deep neural network acoustic models using factorised i-vectors , 2014, INTERSPEECH.

[29]  Xiaofei Wang,et al.  Noise Robust IOA/CAS Speech Separation and Recognition System For The Third 'CHIME' Challenge , 2015, ArXiv.

[30]  Thomas Hain,et al.  The sheffield wargames corpus , 2013, INTERSPEECH.

[31]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[32]  H. Bourlard,et al.  Interpretation of Multiparty Meetings the AMI and Amida Projects , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[33]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[34]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[35]  Lori Lamel,et al.  The translanguage English database (TED) , 1994, ICSLP.

[36]  Michael I. Mandel,et al.  Combining spectral feature mapping and multi-channel model-based source separation for noise-robust automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[37]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[38]  Emmanuel Vincent,et al.  Multichannel Audio Source Separation With Deep Neural Networks , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Rémi Gribonval,et al.  Oracle estimators for the benchmarking of source separation algorithms , 2007, Signal Process..

[40]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[41]  Roland Badeau,et al.  Multichannel Audio Source Separation With Probabilistic Reverberation Priors , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Dimitra Vergyri,et al.  Medium-duration modulation cepstral feature for robust speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Bernd T. Meyer,et al.  Mutual benefits of auditory spectro-temporal Gabor features and deep learning for the 3rd CHiME Challenge , 2015 .

[44]  Henry Cox,et al.  Robust adaptive beamforming , 2005, IEEE Trans. Acoust. Speech Signal Process..

[45]  Naoyuki Kanda,et al.  Elastic spectral distortion for low resource speech recognition with deep neural networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[46]  Sven Fischer,et al.  Suppression of coherent and incoherent noise using a microphone array , 1994 .

[47]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[50]  Marc Moonen,et al.  Superdirective Beamforming Robust Against Microphone Mismatch , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[52]  X. Mestre,et al.  On diagonal loading for minimum variance beamformers , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[53]  Mark A. Poletti,et al.  Spatially Robust Far-field Beamforming Using the von Mises(-Fisher) Distribution , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[54]  Antoine Liutkus,et al.  Scalable audio separation with light Kernel Additive Modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  DeLiang Wang,et al.  Noise Perturbation Improves Supervised Speech Separation , 2015, LVA/ICA.

[56]  Paris Smaragdis,et al.  Adaptive Denoising Autoencoders: A Fine-Tuning Scheme to Learn from Test Mixtures , 2015, LVA/ICA.

[57]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[58]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[59]  Sergei Aleinik,et al.  Improvement of Microphone Array Characteristics for Speech Capturing , 2015 .

[60]  David Lo,et al.  DSM: a specification mining tool using recurrent neural network based language model , 2018, ESEC/SIGSOFT FSE.

[61]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[62]  Chng Eng Siong,et al.  Speech enhancement using beamforming and non negative matrix factorization for robust speech recognition in the CHiME-3 challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[63]  P. Stoica,et al.  Robust Adaptive Beamforming , 2013 .

[64]  Fengyun Zhu,et al.  Noise-Robust ASR for the third 'CHiME' Challenge Exploiting Time-Frequency Masking based Multi-Channel Speech Enhancement and Recurrent Neural Network , 2015, ArXiv.

[65]  Bhiksha Raj,et al.  Microphone array processing for distant speech recognition: Towards real-world deployment , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[66]  Emmanuel Vincent,et al.  Multichannel music separation with deep neural networks , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[67]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[68]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[69]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[70]  Walter Kellermann,et al.  Unbiased coherent-to-diffuse ratio estimation for dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[71]  Martin Graciarena,et al.  Damped oscillator cepstral coefficients for robust speech recognition , 2013, INTERSPEECH.

[72]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[73]  Maxim Korenevsky,et al.  Adaptive beamforming and adaptive training of DNN acoustic models for enhanced multichannel noisy speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[74]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[75]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[76]  John H. L. Hansen,et al.  "CU-move" : analysis & corpus development for interactive in-vehicle speech systems , 2001, INTERSPEECH.

[77]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.