论文信息 - Using VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors

Using VTLN matrices for rapid and computationally-efficient speaker adaptation with robustness to first-pass transcription errors

In this paper, we propose to combine the rapid adaptation capability of conventional Vocal Tract Length Normalization (VTLN) with the computational efficiency of transform-based adaptation such as MLLR or CMLLR. VTLN requires the estimation of only one parameter and is, therefore, most suited for the cases where there is little adaptation data (i.e. rapid adaptation). In contrast, transform-based adaptation methods require the estimation of matrices. However, the drawback of conventional VTLN is that it is computationally expensive since it requires multiple spectral-warping to generate VTLN-warped features. We have recently shown that VTLN-warping can be implemented by a linear-transformation (LT) of the conventional MFCC features. These LTs are analytically pre-computed and stored. In this frame-work of LT VTLN, computational complexity of VTLN is similar to transform-based adaptation since warp-factor estimation can be done using the same sufficient statistics as that are used in CMLLR. We show that VTLN provides significant improvement in performance when there is small adaptation data as compared to transform-based adaptation methods. We also show that the use of an additional decorrelating transform, MLLT, along with the VTLN-matrices, gives performance that is better than MLLR and comparable to SAT with MLLT even for large adaptation data. Further we show that in the mismatched train and test case (i.e. poor first-pass transcription), VTLN provides significant improvement over the transform-based adaptation methods. We compare the performances of different methods on the WSJ, the RM and the TIDIGITS databases. Index Terms: VTLN, Rapid Adaptation, MLLT, CAT, Linear Transform

Srinivasan Umesh | Achintya Kumar Sarkar | Shakti Prasad Rath

[1] Ramesh A. Gopinath,et al. Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2] William J. Byrne,et al. Speaker normalization with all-pass transforms , 1998, ICSLP.

[3] Li Lee,et al. A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[4] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[5] Roland Kuhn,et al. Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[6] Srinivasan Umesh,et al. A computationally efficient approach to warp factor estimation in VTLN using EM algorithm and sufficient statistics , 2008, INTERSPEECH.

[7] Mark J. F. Gales. Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[8] Hermann Ney,et al. Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[9] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[10] Hermann Ney,et al. Implementing frequency-warping and VTLN through linear transformation of conventional MFCC , 2005, INTERSPEECH.

[11] Stephen Cox. Speaker normalization in the MFCC domain , 2000, INTERSPEECH.

[12] Srinivasan Umesh,et al. Study of jacobian compensation using linear transformation of conventional MFCC for VTLN , 2008, INTERSPEECH.

[13] Stephen Cox. Speaker Normalisation in the MFCC Domain , 2000 .

[14] Sankaran Panchapagesan. Frequency warping by linear transformation of standard MFCC , 2006, INTERSPEECH.