Deep Neural Network-Based Noise Estimation for Robust ASR in Dual-Microphone Smartphones

The performance of many noise-robust automatic speech recognition (ASR) methods, such as vector Taylor series (VTS) feature compensation, heavily depends on an estimation of the noise that contaminates speech. Therefore, providing accurate noise estimates for this kind of methods is crucial as well as a challenge. In this paper we investigate the use of deep neural networks (DNNs) to perform noise estimation in dual-microphone smartphones. Thanks to the powerful regression capabilities of DNNs, accurate noise estimates can be obtained by just using simple features as well as exploiting the power level difference (PLD) between the two microphones of the smartphone when employed in close-talk conditions. This is confirmed by our word recognition results on the AURORA2-2C (AURORA2 - 2 Channels - Conversational Position) database by largely outperforming single- and dual-channel noise estimation algorithms from the state-of-the-art when used together with a VTS feature compensation method.

[1]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Yonghong Yan,et al.  A fast two-microphone noise reduction algorithm based on power level ratio for mobile phone , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[3]  Ángel M. Gómez,et al.  A Deep Neural Network Approach for Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech Recognition , 2014, IberSPEECH.

[4]  Christophe Beaugeant,et al.  Noise reduction for dual-microphone mobile phones exploiting power level differences , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jasha Droppo,et al.  A noise-robust ASR front-end using Wiener filter constructed from MMSE estimation of clean speech and noise , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[6]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[8]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Antonio M. Peinado,et al.  Model-based compensation of the additive noise for continuous speech recognition. experiments using the Aurora II database and tasks , 2001, INTERSPEECH.

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[12]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[13]  Philipos C. Loizou,et al.  A noise-estimation algorithm for highly non-stationary environments , 2006, Speech Commun..

[14]  Ángel M. Gómez,et al.  Feature enhancement for robust speech recognition on smartphones with dual-microphone , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[15]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[16]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Emmanuel Vincent Is audio signal processing still useful in the era of machine learning? , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[18]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[19]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.