Robust front-end processing for Speech Recognition in noisy conditions

In this paper, we investigate the applicability and effectiveness of advanced feature compensation techniques in devising a robust front-end for Automatic Speech Recognition (ASR). First, the Vector Taylor Series (VTS) equations are altered by bringing in the auditory masking factor. The resultant VTS approximation is used to compensate the parameters of a clean speech model and a Minimum Mean Square Error (MMSE) estimate is used to estimate the clean speech features from noisy features. Second, we apply root-compression instead of conventional log-compression to the mel-filter banks energy. Third, we apply a frame selection method to eliminate the noise dominated frames to improve the performance in high noise scenarios. The proposed algorithms are validated on noise corrupted Librispeech and TIMIT speech recognition databases and are shown to provide significant gain in performance.

[2]  Yun Lei,et al.  Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions , 2014, INTERSPEECH.

[3]  Oriol Vinyals,et al.  Comparing multilayer perceptron to Deep Belief Network Tandem features for robust ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hans-Günter Hirsch,et al.  The simulation of realistic acoustic input scenarios for speech recognition systems , 2005, INTERSPEECH.

[6]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[8]  Biswajit Das,et al.  Psychoacoustic model compensation for robust continuous speech recognition in additive noise , 2015, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).

[9]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Björn W. Schuller,et al.  Non-negative matrix factorization for highly noise-robust ASR: To enhance or to recognize? , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  David V. Anderson,et al.  Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing , 2006, SAPA@INTERSPEECH.

[12]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[13]  Preeti Rao,et al.  Speech enhancement in nonstationary noise environments using noise properties , 2006, Speech Commun..

[14]  Martin J. Russell,et al.  Text-dependent speaker verification under noisy conditions using parallel model combination , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  Ashish Panda A fast approach to psychoacoustic model compensation for robust speaker recognition in additive noise , 2015, INTERSPEECH.

[16]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Thambipillai Srikanthan,et al.  Psychoacoustic Model Compensation for Robust Speaker Verification in Environmental Noise , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Khe Chai Sim,et al.  Noise adaptive front-end normalization based on Vector Taylor Series for Deep Neural Networks in robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Changchun Bao,et al.  Speech enhancement with weighted denoising auto-encoder , 2013, INTERSPEECH.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[22]  Biswajit Das,et al.  Vector taylor series expansion with auditory masking for noise robust speech recognition , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[23]  Hermann Ney,et al.  Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Marco Matassoni,et al.  An auditory based modulation spectral feature for reverberant speech recognition , 2010, INTERSPEECH.