A Study on Speaker Normalized MLP Features in LVCSR

Different normalization methods are applied in recent Large Vocabulary Continuous Speech Recognition Systems (LVCSR) to reduce the influence of speaker variability on the acoustic models. In this paper we investigate the use of Vocal Tract Length Normalization (VTLN) and Speaker Adaptive Training (SAT) in Multi Layer Perceptron (MLP) feature extraction on an English task. We achieve significant improvements by each normalization method and we gain further by stacking the normalizations. Studying features transformed by Constrained Maximum Likelihood Linear Regression (CMLLR) based SAT as possible input for MLP, further experiments show that MLP could not consistently take advantage of SAT as it does in case of VTLN. Index Terms: GMM-HMM, VTLN, SAT, CMLLR, LVCSR, MLP, Dempster-Shafer

[1]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Wen Wang,et al.  A comparative large scale study of MLP features for Mandarin ASR , 2010, INTERSPEECH.

[4]  Georg Heigold,et al.  The RWTH 2007 TC-STAR evaluation system for european English and Spanish , 2007, INTERSPEECH.

[5]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[6]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Hermann Ney,et al.  The RWTH 2009 quaero ASR evaluation system for English and German , 2010, INTERSPEECH.

[8]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Fabio Valente Multi-stream speech recognition based on Dempster-Shafer combination rule , 2010, Speech Commun..

[10]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Hermann Ney,et al.  Vocal tract normalization as linear transformation of MFCC , 2003, INTERSPEECH.

[12]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[13]  Tanja Schultz,et al.  The 2010 CMU GALE speech-to-text system , 2010, INTERSPEECH.

[14]  Georg Heigold,et al.  Recent improvements of the RWTH GALE Mandarin LVCSR system , 2008, INTERSPEECH.

[15]  Florian Metze,et al.  Analysis of gender normalization using MLP and VTLN features , 2010, INTERSPEECH.

[16]  Jean-Luc Gauvain,et al.  Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[17]  Hermann Ney,et al.  Speaker adaptive modeling by vocal tract normalization , 2002, IEEE Trans. Speech Audio Process..