DISCRIMINATIVE METHODS FOR NOISE ROBUST SPEECH RECOGNITION: A CHIME CHALLENGE BENCHMARK

The recently introduced second CHiME challenge is a difficult two-microphone speech recognition task with non-stationary interference. Current approaches in the source-separation community have focused on the front-end problem of estimating the clean signal given the noisy signals. Here we pursue a different approach, focusing on state-of-the-art ASR techniques such as discriminative training and various feature transformations, in addition to simple noise suppression methods based on prior-based binary masking with estimated angle of arrival. In addition, we propose an augmented discriminative feature transformation that can introduce arbitrary features to a discriminative feature transform, an efficient combination method of Discriminative Language Modeling (DLM) and Minimum Bayes Risk (MBR) decoding in an ASR post-processing stage, and preliminarily investigate the effectiveness of deep neural networks for reverberated and noisy speech recognition. Using these techniques we present a benchmark on the middle-vocabulary subtask of CHiME challenge, showing their effectiveness for this task. Promising results were also obtained for the proposed augmented feature transformation and combination of DLM and MBR decoding. A part of the training code has been released as an advanced ASR baseline, using the Kaldi speech recognition toolkit. Index Terms—CHiME challenge, Noise robust ASR, Discriminative methods, Feature transformation, Prior-based binary masking

[1]  Yuuki Tachioka,et al.  Effectiveness of discriminative training and feature transformation for reverberated and noisy speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[3]  Akinori Ito,et al.  Round-Robin Duel Discriminative Language Models , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Haihua Xu,et al.  An improved consensus-like method for Minimum Bayes Risk decoding and lattice combination , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  W·M·贝尔特曼,et al.  Speech audio process , 2011 .

[6]  Ebru Arisoy,et al.  Minimum Bayes risk discriminative language models for Arabic speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[11]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Masakiyo Fujimoto,et al.  Speech recognition in the presence of highly non-stationary noise based on spatial, spectral and temporal speech/noise modeling combined with dynamic variance adaptation , 2011 .

[13]  Brian Roark,et al.  Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[14]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[15]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Thomas Hain,et al.  Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[19]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[21]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Ethem Alpaydin,et al.  Classification and Ranking Approaches to Discriminative Language Modeling for ASR , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.