Feature enhancement by deep LSTM networks for ASR in reverberant multisource environments

Abstract This article investigates speech feature enhancement based on deep bidirectional recurrent neural networks. The Long Short-Term Memory (LSTM) architecture is used to exploit a self-learnt amount of temporal context in learning the correspondences of noisy and reverberant with undistorted speech features. The resulting networks are applied to feature enhancement in the context of the 2013 2nd Computational Hearing in Multisource Environments (CHiME) Challenge track 2 task, which consists of the Wall Street Journal (WSJ-0) corpus distorted by highly non-stationary, convolutive noise. In extensive test runs, different feature front-ends, network training targets, and network topologies are evaluated in terms of frame-wise regression error and speech recognition performance. Furthermore, we consider gradually refined speech recognition back-ends from baseline ‘out-of-the-box’ clean models to discriminatively trained multi-condition models adapted to the enhanced features. In the result, deep bidirectional LSTM networks processing log Mel filterbank outputs deliver best results with clean models, reaching down to 42% word error rate (WER) at signal-to-noise ratios ranging from −6 to 9 dB (multi-condition CHiME Challenge baseline: 55% WER). Discriminative training of the back-end using LSTM enhanced features is shown to further decrease WER to 22%. To our knowledge, this is the best result reported for the 2nd CHiME Challenge WSJ-0 task yet.

[1]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[2]  Erik Marchi,et al.  Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting , 2011, Cognitive Neurodynamics.

[3]  John R. Hershey,et al.  Efficient model-based speech separation and denoising using non-negative subspace analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[6]  Tuomas Virtanen,et al.  Exemplar-based Recognition of Speech in Highly Variable Noise , 2011 .

[7]  A. Gray,et al.  Distance measures for speech processing , 1976 .

[8]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[9]  Nelson Morgan,et al.  Evaluating long-term spectral subtraction for reverberant ASR , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[10]  Roland Maas,et al.  AT wo-Channel Acoustic Front-End for Robust Automatic Speech Recognition in Noisy and Reverberant Environments , 2011 .

[11]  Yuuki Tachioka,et al.  Effectiveness of discriminative training and feature transformation for reverberated and noisy speech , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Björn Schuller,et al.  The Munich Feature Enhancement Approach to the 2013 CHiME Challenge Using BLSTM Recurrent Neural Networks , 2013, ICASSP 2013.

[13]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[14]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[16]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Björn W. Schuller,et al.  Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[19]  Francesco Nesta,et al.  A FLEXIBLE SPATIAL BLIND SOURCE EXTRACTION FRAMEWORK FOR ROBUST SPEECH RECOGNITION IN NOISY ENVIRONMENTS , 2013 .

[20]  Björn W. Schuller,et al.  Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement , 2009, EURASIP J. Audio Speech Music. Process..

[21]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[23]  John Makhoul,et al.  Speaker adaptive training: a maximum likelihood approach to speaker normalization , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Björn W. Schuller,et al.  Feature Frame Stacking in RNN-Based Tandem ASR Systems - Learned vs. Predefined Context , 2011, INTERSPEECH.

[27]  Björn Schuller,et al.  The TUM+TUT+KUL approach to the CHiME challenge 2013: Multi-stream ASR exploiting BLSTM networks and sparse NMF , 2013 .

[28]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Andrew L. Maas,et al.  RECURRENT NEURAL NETWORK FEATURE ENHANCEMENT : THE 2 nd CHIME CHALLENGE , 2013 .

[30]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Bhiksha Raj,et al.  Non-negative matrix factorization based compensation of music for automatic speech recognition , 2010, INTERSPEECH.

[32]  George Saon,et al.  Maximum likelihood discriminant feature spaces , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[33]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Phil D. Green,et al.  Speech enhancement with missing data techniques using recurrent neural networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[36]  Y.-Q. Wang,et al.  Model-based approaches to handling additive noise in reverberant environments , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[37]  Björn W. Schuller,et al.  Non-negative matrix factorization for highly noise-robust ASR: To enhance or to recognize? , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .