论文信息 - The Munich 2011 CHiME Challenge Contribution: NMF-BLSTM Speech Enhancement and Recognition for Reverberated Multisource Environments

The Munich 2011 CHiME Challenge Contribution: NMF-BLSTM Speech Enhancement and Recognition for Reverberated Multisource Environments

We present the Munich contribution to the PASCAL ‘CHiME’ Speech Separation and Recognition Challenge: Our approach combines source separation by supervised convolutive non-negative matrix factorisation (NMF) with our tandem recogniser that augments acoustic features by word predictions of a Long Short-Term Memory recurrent neural network in a multi-stream Hidden Markov Model. The performance of our source separation approach is demonstrated in a sequence of gradually refined speech recognisers. While NMF drastically improves performance for all investigated recognisers, best results are obtained with the multi-stream approach along with a novel adaptation technique for noise dictionaries in supervised NMF. On thefinal Challenge test set, the proposed system delivers an average keyword recognition accuracy of 87.86% across SNRs ranging from -6 to 9dB, reducing the error rate from 44% to 12% compared to the Challenge baseline. Index Terms: Non-Negative Matrix Factorisation, Tandem Speech Recognition

Björn Schuller | Martin Wöllmer | Gerhard Rigoll | Felix Weninger | Jürgen T. Geiger

[1] Tuomas Virtanen,et al. Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine , 2005, 2005 13th European Signal Processing Conference.

[2] Ning Ma,et al. The CHiME corpus: a resource and a challenge for computational hearing in multisource environments , 2010, INTERSPEECH.

[3] Paris Smaragdis,et al. Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Mikkel N. Schmidt,et al. Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[5] Bhiksha Raj,et al. Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[6] Björn W. Schuller,et al. OpenBliSSART: Design and evaluation of a research toolkit for Blind Source Separation in Audio Recognition Tasks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Emmanuel Vincent,et al. Sound Source Separation , 2011 .

[8] Paris Smaragdis,et al. Mitsubishi Electric Research Laboratories , 1994 .

[9] Tuomas Virtanen,et al. Artificial and online acquired noise dictionaries for noise robust ASR , 2010, INTERSPEECH.

[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[11] Bhiksha Raj,et al. Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Björn W. Schuller,et al. Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening , 2010, IEEE Journal of Selected Topics in Signal Processing.

[13] Björn W. Schuller,et al. A multi-stream ASR framework for BLSTM modeling of conversational speech , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Björn W. Schuller,et al. Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework , 2010, Cognitive Computation.

[15] Andrzej Cichocki,et al. A Multiplicative Algorithm for Convolutive Non-Negative Matrix Factorization Based on Squared Euclidean Distance , 2009, IEEE Transactions on Signal Processing.

[16] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[17] Björn W. Schuller,et al. Robust in-car spelling recognition - a tandem BLSTM-HMM approach , 2009, INTERSPEECH.

[18] John R. Hershey,et al. Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..