ASR-driven top-down binary mask estimation using spectral priors

Typical mask estimation algorithms use low-level features to estimate the interfering noise or instantaneous SNR. We propose a simple top-down approach to mask estimation. The estimated mask is based on a specific hypothesis of the underlying speech without using information about the interference or the instantaneous SNR. In this pilot study, we observe a 9% reduction in word error over a baseline recognition system on the Aurora4 corpus, though much greater gains could theoretically be achieved through improvements to the model selection process. We also present SNR improvement results showing our method performs as well as a standard MMSE-based method, demonstrating that speech recognition can aid speech enhancement. Thus, the relationship between recognition and enhancement need not be one way: linguistic information can play a significant role in speech enhancement.

[1]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[2]  Eric Fosler-Lussier,et al.  Investigations into the incorporation of the Ideal Binary Mask in ASR , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Richard M. Stern,et al.  Reconstruction of missing features for robust speech recognition , 2004, Speech Commun..

[4]  John H. L. Hansen,et al.  A Novel Mask Estimation Method Employing Posterior-Based Representative Mean Estimate for Missing-Feature Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[6]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[7]  Daniel P. W. Ellis,et al.  Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures , 1999, Speech Commun..

[8]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[9]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[11]  DeLiang Wang,et al.  Robust speech recognition by integrating speech separation and hypothesis testing , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..