A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain

In this work, we demonstrate that the most widely-used model for the relationship between noisy speech, clean speech and noise in the log-Mel domain is inaccurate due to its disregard of the phase. Moreover, we show how a more exact model can be derived by averaging over the phase in the log-Mel domain, and how this can profitably be applied to particle filter based sequential noise compensation. Experimental results confirm the superiority of the phase-averaged model for both clean speech estimation in general and the particle filter in particular. Reductions in word error rate of up to 17% relative were obtained on a large vocabulary task.

[1]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[4]  Richard M. Stern,et al.  On tracking noise with linear dynamical system models , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[6]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[7]  VargaAndrew,et al.  Assessment for automatic speech recognition II , 1993 .

[8]  Li Deng,et al.  Enhancement of log Mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise , 2004, IEEE Transactions on Speech and Audio Processing.

[9]  Nam-Soo Kim Nonstationary environment compensation based on sequential estimation , 1998 .

[10]  Fernando Pereira,et al.  Efficient general lattice generation and rescoring , 1999, EUROSPEECH.

[11]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Nam Soo Kim IMM-based estimation for slowly evolving environments , 1998, IEEE Signal Processing Letters.

[13]  S. Nakamura,et al.  Sequential Noise Compensation by Sequential Monte Carlo Method , 2001, NIPS.

[14]  Friedrich Faubel,et al.  Overcoming the Vector Taylor Series Approximation in Speech Feature Enhancement - A Particle Filter Approach , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  S. Nakamura,et al.  Particle filtering and Polyak averaging-based non-stationary noise tracking for ASR in noise , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[16]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[17]  Li Deng,et al.  A Bayesian approach to speech feature enhancement using the dynamic cepstral prior , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Nam-Soo Kim IMM-based estimation for slowly evolving environments , 1998, IEEE Signal Process. Lett..