DNN-Based Feature Enhancement Using DOA-Constrained ICA for Robust Speech Recognition

The performance of automatic speech recognition (ASR) system is often degraded in adverse real-world environments. In recent times, deep learning has successfully emerged as a breakthrough for acoustic modeling in ASR; accordingly, deep neural network (DNN)-based speech feature enhancement (FE) approaches have attracted much attention owing to their powerful modeling capabilities. However, DNN-based approaches are unable to achieve remarkable performance improvements for speech with severe distortion in the test environments different from training environments. In this letter, we propose a DNN-based FE method where the DNN inputs include preenhanced spectral features computed from multichannel input signals to reconstruct noise-robust features. The preenhanced spectral features are obtained by direction-of-arrival (DOA)-constrained independent component analysis (DCICA) followed by Bayesian FE using a hidden-Markov-model prior, to exploit the capabilities of efficient online target speech extraction and efficient FE with prior information for robust ASR. In addition, noise spectral features computed from DCICA are included for further improvement. Therefore, the DNN is trained to reconstruct a clean spectral feature vector, from a sequence of corrupted input feature vectors in addition to the corresponding preenhanced and noise feature vectors. Experimental results demonstrate that the proposed method significantly improves recognition performance, even in mismatched noise conditions.

[1]  Ji-Won Cho,et al.  Independent vector analysis followed by HMM-based feature enhancement for robust speech recognition , 2016, Signal Process..

[2]  Hyung-Min Park,et al.  Speech enhancement based on softmasking exploiting both output SNR and selectivity of spatial filtering , 2014 .

[3]  Yuuki Tachioka,et al.  Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  James R. Glass,et al.  Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Ulpu Remes,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012 .

[7]  Masakiyo Fujimoto,et al.  Exploring multi-channel features for denoising-autoencoder-based speech enhancement , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[9]  VargaAndrew,et al.  Assessment for automatic speech recognition II , 1993 .

[10]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[11]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Richard M. Stern,et al.  The effects of background music on speech recognition accuracy , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[15]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[16]  Tao Zhang,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[22]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[23]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[24]  Ji-Won Cho,et al.  An Efficient HMM-Based Feature Enhancement Method With Filter Estimation for Reverberant Speech Recognition , 2013, IEEE Signal Processing Letters.

[25]  Bhiksha Raj,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[26]  Reinhold Häb-Umbach,et al.  Model-Based Feature Enhancement for Reverberant Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Hyung-Min Park,et al.  Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition , 2015, Signal Process..

[28]  Antonio M. Peinado,et al.  Model-based compensation of the additive noise for continuous speech recognition. experiments using the Aurora II database and tasks , 2001, INTERSPEECH.

[29]  DeLiang Wang,et al.  Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.