Vocal tract length normalization using rapid maximum-likelihood estimation for speech recognition

Speaker normalization techniques for correcting differences in the vocal tract lengths of different speakers, referred to as vocal tract length normalization, in a large vocabulary voice recognition system using a hidden Markov model (HMM), have been proposed in recent years. In this paper, a scheme for approximating especially small changes in the vocal tract length by linear mapping using a vocal tract length parameter in cepstrum space and maximum-likelihood estimation of this parameter from vocalization is proposed. The proposed method can estimate a more optimal parameter for a speaker with a small amount of computation than in past schemes using multiple vocal tract length parameters in advance. In evaluation tests of the recognition of 5000 single Japanese words, the proposed scheme decreased errors by 7.1% alone and 14.6% in combination with cepstrum mean normalization (CMN). © 2002 Wiley Periodicals, Inc. Syst Comp Jpn, 33(5): 30–40, 2002; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.1125

[1]  Richard M. Schwartz,et al.  Fast robust inverse transform speaker adapted training using diagonal transformations , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Reinhold Häb-Umbach,et al.  A study on speaker normalization using vocal tract normalization and speaker adaptive training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Alan V. Oppenheim,et al.  Discrete-Time Signal Pro-cessing , 1989 .

[6]  William J. Byrne,et al.  Single-pass adapted training with all-pass transforms , 1999, EUROSPEECH.

[7]  Philip C. Woodland,et al.  Experiments in speaker normalisation and adaptation for large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[10]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Alan V. Oppenheim,et al.  Discrete representation of signals , 1972 .

[12]  Puming Zhan,et al.  Speaker normalization based on frequency warping , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Hermann Ney,et al.  Improved methods for vocal tract normalization , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[14]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.