Rapid vocal tract length normalization using maximum likelihood estimation

Recently, vocal tract length normalization (VTLN) techniques have been developed for speaker normalization in speech recognition. This paper proposes a new VTLN method, in which the vocal tract length is normalized in the cepstrum space by means of linear mapping whose parameter is derived using maximumlikelihood estimation. The computational costs of this method are much lower than that of such conventional methods as ML-VTLN, in which the parameter for mapping is selected from among several parameters. Further, the new method offers greater precision in determining parameters for individual speakers. Experimental use of the method resulted in an error reduction rate of 7.1%. A combination of the proposed method with cepstrum mean normalization (CMN) method was also examined and found to reduce the error rate even more, by 14.6%.

[1]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[3]  William J. Byrne,et al.  Single-pass adapted training with all-pass transforms , 1999, EUROSPEECH.

[4]  Richard M. Schwartz,et al.  Fast robust inverse transform speaker adapted training using diagonal transformations , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[6]  Reinhold Häb-Umbach,et al.  A study on speaker normalization using vocal tract normalization and speaker adaptive training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[9]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  William J. Byrne,et al.  Speaker adaptation with all-pass transforms , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[14]  Hermann Ney,et al.  Improved methods for vocal tract normalization , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Takao Watanabe,et al.  Large vocabulary word recognition based on demi-syllable hidden Markov model using small amount of training data , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[16]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.