Extended VTS for Noise-Robust Speech Recognition

Model compensation is a standard way of improving the robustness of speech recognition systems to noise. A number of popular schemes are based on vector Taylor series (VTS) compensation, which uses a linear approximation to represent the influence of noise on the clean speech. To compensate the dynamic parameters, the continuous time approximation is often used. This approximation uses a point estimate of the gradient, which fails to take into account that dynamic coefficients are a function of a number of consecutive static coefficients. In this paper, the accuracy of dynamic parameter compensation is improved by representing the dynamic features as a linear transformation of a window of static features. A modified version of VTS compensation is applied to the distribution of the window of static features and, importantly, their correlations. These compensated distributions are then transformed to distributions over standard static and dynamic features. With this improved approximation, it is also possible to obtain full-covariance corrupted speech distributions. This addresses the correlation changes that occur in noise. The proposed scheme outperformed the standard VTS scheme by 10% to 20% relative on a range of tasks.

[1]  Alex Acero,et al.  Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Mark J. F. Gales,et al.  Adaptive Training with Joint Uncertainty Decoding for Robust Recognition of Noisy Data , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[4]  Chong Kwan Un,et al.  Speech recognition in noisy environments using first-order vector Taylor series , 1998, Speech Commun..

[5]  Mark J. F. Gales,et al.  Joint uncertainty decoding for noise robust speech recognition , 2005, INTERSPEECH.

[6]  Li Deng,et al.  Uncertainty decoding with SPLICE for noise robust speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[8]  Masami Akamine,et al.  Bayesian feature enhancement using a mixture of unscented transformation for uncertainty decoding of noisy speech , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[10]  Jean Paul Haton,et al.  Statistical adaptation of acoustic models to noise conditions for robust speech recognition , 2002, INTERSPEECH.

[11]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[13]  Michael Picheny,et al.  Robust speech recognition in noise --- performance of the IBM continuous speech recogniser on the ARPA noise spoke task , 1995 .

[14]  Mark J. F. Gales,et al.  Covariance modelling for noise-robust speech recognition , 2008, INTERSPEECH.

[15]  Hank Liao,et al.  Joint uncertainty decoding for robust large vocabulary speech recognition , 2006 .

[16]  Mark J. F. Gales,et al.  Extended VTS for Noise-Robust Speech Recognition , 2011, IEEE Trans. Speech Audio Process..

[17]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[18]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[19]  David Kryze,et al.  Vector taylor series based joint uncertainty decoding , 2006, INTERSPEECH.

[20]  Mark J. F. Gales,et al.  Predictive linear transforms for noise robust speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[21]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[22]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[23]  Saeed Vaseghi,et al.  Speech recognition in noisy environments , 1992, ICSLP.

[24]  Jeffrey K. Uhlmann,et al.  Unscented filtering and nonlinear estimation , 2004, Proceedings of the IEEE.