论文信息 - Analysis of gender normalization using MLP and VTLN features

Analysis of gender normalization using MLP and VTLN features

This paper analyzes the capability of multilayer perceptron frontends to perform speaker normalization. We find the context decision tree to be a very useful tool to assess the speaker normalization power of different frontends. We introduce a gender question into the training of the phonetic context decision tree. After the context clustering the gender specific models are counted. We compare this for the following frontends: (1) Bottle-Neck (BN) with and without vocal tract length normalization (VTLN), (2) standard MFCC, (3) stacking of multiple MFCC frames with linear discriminant analysis (LDA). We find the BN-frontend to be even more effective in reducing the number of gender questions than VTLN. From this we conclude that a Bottle-Neck frontend is more effective for gender normalization. Combining VTLN and BN-features reduces the number of gender specific models further.

Florian Metze | Thomas Schaaf

[1] Frantisek Grézl,et al. Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Mark J. F. Gales,et al. Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[3] Herbert Gish,et al. A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4] Tanja Schultz,et al. Advances in the CMU/Interact Arabic GALE Transcription System , 2007, NAACL.

[5] H. Ney,et al. Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] S. Wegmann,et al. Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7] Lukás Burget,et al. Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[8] Ivica Rogina,et al. Integrating dynamic speech modalities into context decision trees , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9] Herbert Gish,et al. Understanding and improving speech recognition performance through the use of diagnostic tools , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10] Andreas Stolcke,et al. On using MLP features in LVCSR , 2004, INTERSPEECH.

[11] Alexander H. Waibel,et al. Speaker normalization and speaker adaptation - a combination for conversational speech recognition , 1997, EUROSPEECH.

[12] Daniel P. W. Ellis,et al. Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13] Detlef Koll,et al. Modeling and efficient decoding of large vocabulary conversational speech , 1999, EUROSPEECH.

[14] Jean-Luc Gauvain,et al. Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[15] A. Waibel,et al. A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[16] Mark J. F. Gales,et al. Training and adapting MLP features for Arabic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Michael Finke,et al. Wide context acoustic modeling in read vs. spontaneous speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.