A Study on Combining VTLN and SAT to Improve the Performance of Automatic Speech Recognition

In this paper, we present ideas to combine VTLN and SAT to improve the performance of automatic speech recognition. We show that VTLN matrices can be used as SAT transformation matrices in recognition, though the training still follows conventional SAT. This will be useful when there is very little adaptation data and the SAT transformation matrix can not be estimated to perform the required adaptation. We also present a study to understand whether VTLN can be performed after SAT and whether such a combination is better than the conventional approach, where VTLN is performed before SAT. Finally, we present a novel approach to perform VTLN by using VTLNmatrices in cascade. This allows us to include warping-factors that are not included in the initial search space. We show through recognition experiments that these combinations improve the performance of ASR, with major gains in the mis-matched train and test speaker conditions.

[1]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[2]  Hermann Ney,et al.  Revisiting VTLN using linear transformation on conventional MFCC , 2010, INTERSPEECH.

[3]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[4]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[5]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[6]  Srinivasan Umesh,et al.  Study of jacobian compensation using linear transformation of conventional MFCC for VTLN , 2008, INTERSPEECH.

[7]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[9]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.