Improving joint uncertainty decoding performance by predictive methods for noise robust speech recognition

Model-based noise compensation techniques, such as Vector Taylor Series (VTS) compensation, have been applied to a range of noise robustness tasks. However one of the issues with these forms of approach is that for large speech recognition systems they are computationally expensive. To address this problem schemes such as Joint Uncertainty Decoding (JUD) have been proposed. Though computationally more efficient, the performance of the system is typically degraded. This paper proposes an alternative scheme, related to JUD, but making fewer approximations, VTS-JUD. Unfortunately this approach also removes some of the computational advantages of JUD. To address this, rather than using VTS-JUD directly, it is used instead to obtain statistics to estimate a predictive linear transform, PCMLLR. This is both computationally efficient and limits some of the issues associated with the diagonal covariance matrices typically used with schemes such as VTS. PCMLLR can also be simply used within an adaptive training framework (PAT). The performance of the VTS-JUD, PCMLLR and PAT system were compared to a number of standard approaches on an in-car speech recognition task. The proposed scheme is an attractive alternative to existing approaches.

[1]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Alex Acero,et al.  Noise adaptive training using a vector taylor series approach for noise robust automatic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[5]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[6]  Mark J. F. Gales,et al.  Adaptive Training with Joint Uncertainty Decoding for Robust Recognition of Noisy Data , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Mark J. F. Gales,et al.  Extended VTS for Noise-Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Mark J. F. Gales,et al.  Incremental predictive and adaptive noise compensation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Yu Hu,et al.  Irrelevant variability normalization based HMM training using VTS approximation of an explicit model of environmental distortions , 2007, INTERSPEECH.

[10]  Mark J. F. Gales,et al.  Joint uncertainty decoding for noise robust speech recognition , 2005, INTERSPEECH.

[11]  Mark J. F. Gales,et al.  Predictive linear transforms for noise robust speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).