A CHiME-3 challenge system: Long-term acoustic features for noise robust automatic speech recognition

The paper describes an automatic speech recognition (ASR) system for the 3rd CHiME challenge that addresses noisy acoustic scenes within public environments. The proposed system includes a multi-channel speech enhancement front-end including a microphone channel failure detection method that is based on cross-comparing the modulation spectra of speech to detect erroneous microphone recordings. The main focus of the submission is the investigation of the amplitude modulation filter bank (AMFB) as a method to extract long-term acoustic cues prior to a Gaussian mixture model (GMM) or deep neural network (DNN) based ASR classifier. It is shown that AMFB features outperform the commonly used frame splicing technique of filter bank features even on a performance optimized ASR challenge system. I.e., temporal analysis of speech by hand-crafted and auditory motivated AMFBs is shown to be more robust compared to a data-driven method based on extracting temporal dynamics with a DNN. Our final ASR system, which additionally includes adaptation of acoustic features to speaker characteristics, achieves an absolute word error rate reduction of approx. 21.53 % relative to the best CHiME-3 baseline system on the "real" test condition.

[1]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  X. Mestre,et al.  On diagonal loading for minimum variance beamformers , 2003, Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795).

[5]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[6]  Klaus Uwe Simmer,et al.  Superdirective Microphone Arrays , 2001, Microphone Arrays.

[7]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[8]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[9]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[10]  Birger Kollmeier,et al.  An Auditory Inspired Amplitude Modulation Filter Bank for Robust Feature Extraction in Automatic Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Marc Moonen,et al.  Binaural Cue Preservation for Hearing Aids using an Interaural Transfer Function Multichannel Wiener Filter , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Hakan Erdogan,et al.  Incremental on-line feature space MLLR adaptation for telephony speech recognition , 2002, INTERSPEECH.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Hervé Bourlard,et al.  Microphone array post-filter based on noise field coherence , 2003, IEEE Trans. Speech Audio Process..

[15]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[16]  Yu Hu,et al.  Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[17]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[18]  Frédéric E. Theunissen,et al.  The Modulation Transfer Function for Speech Intelligibility , 2009, PLoS Comput. Biol..

[19]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[20]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[21]  S. S. Stevens,et al.  Critical Band Width in Loudness Summation , 1957 .

[22]  S. Furui,et al.  Speaker-independent isolated word recognition based on emphasized spectral dynamics , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[25]  Birger Kollmeier,et al.  Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  S. Doclo,et al.  JOINT DEREVERBERATION AND NOISE REDUCTION USING BEAMFORMING AND A SINGLE-CHANNEL SPEECH ENHANCEMENT SCHEME , 2014 .

[27]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.