Robust speech recognition through selection of speaker and environment transforms

In this paper, we address the problem of robustness to both noise and speaker-variability in automatic speech recognition (ASR). We propose the use of pre-computed Noise and Speaker transforms, and an optimal combination of these two transforms are chosen during test using maximum-likelihood (ML) criterion. These pre-computed transforms are obtained during training by using data obtained from different noise conditions that are usually encountered for that particular ASR task. The environment transforms are obtained during training using constrained-MLLR (CMLLR) framework, while for speaker-transforms we use the analytically determined linear-VTLN matrices. Even though the exact noise environment may not be encountered during test, the ML-based choice of the closest Environment transform provides “sufficient” cleaning and this is corroborated by experimental results with performance comparable to histogram equalization or Vector Taylor Series approaches on Aurora-2 task. The proposed method is simple since it involves only the choice of pre-computed environment and speaker transforms and therefore, can be applied with very little test data unlike many other speaker and noise-compensation methods.

[1]  Mark J. F. Gales Acoustic factorisation , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  Vikas Joshi,et al.  Efficient Speaker and Noise Normalization for Robust Speech Recognition , 2011, INTERSPEECH.

[3]  Yongqiang Wang,et al.  Speaker and noise factorisation on the AURORA4 task , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[5]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[6]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  Richard M. Stern,et al.  A vector Taylor series approach for environment-independent speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Alex Acero,et al.  Separating Speaker and Environmental Variability Using Factored Transforms , 2011, INTERSPEECH.

[9]  Srinivasan Umesh,et al.  A computationally efficient approach to warp factor estimation in VTLN using EM algorithm and sufficient statistics , 2008, INTERSPEECH.

[10]  José C. Segura,et al.  Combining speaker and noise feature normalization techniques for Automatic Speech Recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mark J. F. Gales,et al.  Rapid joint speaker and noise compensation for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).