The Munich 2011 CHiME Challenge Contribution: NMF-BLSTM Speech Enhancement and Recognition for Reverberated Multisource Environments

We present the Munich contribution to the PASCAL ‘CHiME’ Speech Separation and Recognition Challenge: Our approach combines source separation by supervised convolutive non-negative matrix factorisation (NMF) with our tandem recogniser that augments acoustic features by word predictions of a Long Short-Term Memory recurrent neural network in a multi-stream Hidden Markov Model. The performance of our source separation approach is demonstrated in a sequence of gradually refined speech recognisers. While NMF drastically improves performance for all investigated recognisers, best results are obtained with the multi-stream approach along with a novel adaptation technique for noise dictionaries in supervised NMF. On thefinal Challenge test set, the proposed system delivers an average keyword recognition accuracy of 87.86% across SNRs ranging from -6 to 9dB, reducing the error rate from 44% to 12% compared to the Challenge baseline. Index Terms: Non-Negative Matrix Factorisation, Tandem Speech Recognition

[1]  Tuomas Virtanen,et al.  Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine , 2005, 2005 13th European Signal Processing Conference.

[2]  Ning Ma,et al.  The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[3]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[5]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[6]  Björn W. Schuller,et al.  OpenBliSSART: Design and evaluation of a research toolkit for Blind Source Separation in Audio Recognition Tasks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Emmanuel Vincent,et al.  Sound Source Separation , 2011 .

[8]  Paris Smaragdis,et al.  Mitsubishi Electric Research Laboratories , 1994 .

[9]  Tuomas Virtanen,et al.  Artificial and online acquired noise dictionaries for noise robust ASR , 2010, INTERSPEECH.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Björn W. Schuller,et al.  Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[13]  Björn W. Schuller,et al.  A multi-stream ASR framework for BLSTM modeling of conversational speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Björn W. Schuller,et al.  Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework , 2010, Cognitive Computation.

[15]  Andrzej Cichocki,et al.  A Multiplicative Algorithm for Convolutive Non-Negative Matrix Factorization Based on Squared Euclidean Distance , 2009, IEEE Transactions on Signal Processing.

[16]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[17]  Björn W. Schuller,et al.  Robust in-car spelling recognition - a tandem BLSTM-HMM approach , 2009, INTERSPEECH.

[18]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..