Reduction of Highly Nonstationary Ambient Noise by Integrating Spectral and Locational Characteristics of Speech and Noise for Robust ASR

This paper proposes a new multi-channel noise reduction approach that can appropriately handle highly nonstationary noise based on the spectral and locational features of speech and noise. We focus on a distant talking scenario, where a 2-ch microphone array receives a target speaker’s voice from the front while it receives highly nonstationary ambient noise from any direction. To cope well with this scenario, we introduce prior training not only for the spectral features of speech and noise but also for their locational features, and utilize them in a unified manner. The proposed method can distinguish rapid changes in speech and noise based mainly on their locational features, while it can reliably estimate the spectral shapes of the speech based largely on the spectral features. A filter-bank based implementation is also discussed to enable the proposed method to work in real time. Experiments using the PASCAL CHiME separation and recognition challenge task show the superiority of the proposed method as regards both speech quality and automatic speech recognition performance.

[1]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Masakiyo Fujimoto,et al.  Non-stationary noise estimation method based on bias-residual component decomposition for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hiroshi Sawada,et al.  A Two-Stage Frequency-Domain Blind Source Separation Method for Underdetermined Convolutive Mixtures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[4]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[5]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[6]  Masakiyo Fujimoto,et al.  Joint unsupervised learning of hidden Markov source models and source location models for multichannel source separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).