Channel mapping using bidirectional long short-term memory for dereverberation in hands-free voice controlled devices

In this article, the reverberation problem for hands-free voice controlled devices is addressed by employing Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks. Such networks use memory blocks in the hidden units, enabling them to exploit a self-learnt amount of temporal context. The main objective of this technique is to minimize the mismatch between the distant talk (reverberant/distorted) speech and the close talk (clean) speech. To achieve this, the network is trained by mapping the cepstral feature space from the distant talk channel to its counterpart from the close talk channel frame-wisely in terms of regression. The method has been successfully evaluated on a realistically recorded reverberant French corpus by a large scale of experiments of comparing a variety of network architectures, investigating different network training targets (differential or absolute), and combining with common adaptation techniques. In addition, the robustness of this technique is also accessed by cross-room evaluation on both, a simulated French corpus and a realistic English corpus. Experimental results show that the proposed novel BLSTM dereverberation models trained by the differential targets reduce the word error rate (WER) by 16% relatively on the French corpus (intra room scenario) as well as 8% relatively on the English corpus (inter room scenario).

[1]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[2]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[3]  Björn W. Schuller,et al.  Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[5]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Todor Ganchev,et al.  A practical, real-time speech-driven home automation front-end , 2005, IEEE Transactions on Consumer Electronics.

[8]  Björn W. Schuller,et al.  Online Driver Distraction Detection Using Long Short-Term Memory , 2011, IEEE Transactions on Intelligent Transportation Systems.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[11]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[12]  Jie Hao,et al.  Robust mandarin speech recognition in car environments for embedded navigation system , 2008, IEEE Transactions on Consumer Electronics.

[13]  Takashi Tsuzuki,et al.  A new digital TV interface employing speech recognition , 2003, IEEE Trans. Consumer Electron..

[14]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[15]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[16]  Björn Schuller,et al.  The Munich Feature Enhancement Approach to the 2013 CHiME Challenge Using BLSTM Recurrent Neural Networks , 2013, ICASSP 2013.

[17]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[18]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[19]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[21]  Andrew L. Maas,et al.  RECURRENT NEURAL NETWORK FEATURE ENHANCEMENT: THE 2nd CHIME CHALLENGE , 2013 .

[22]  Mark J. F. Gales,et al.  Impact of single-microphone dereverberation on DNN-based meeting transcription systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[24]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[25]  Hermann Ney,et al.  Feature combination and stacking of recurrent and non-recurrent neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Weifeng Li,et al.  Robust overlapping speech recognition based on neural networks , 2007 .

[27]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[28]  Bhiksha Raj,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.