Combining Discriminative Feature, Transform, and Model Training for Large Vocabulary Speech Recognition

Recent developments in large vocabulary continuous speech recognition (LVCSR) have shown the effectiveness of discriminative training approaches, employing the following three representative techniques: discriminative Gaussian training using the minimum phone error (MPE) criterion, discriminately trained features estimated by multilayer perceptrons (MLPs); and discriminative feature transforms such as feature-level MPE (fMPE). Although MLP features, MPE models, and fMPE transforms have each been shown to improve recognition accuracy, no previous work has applied all three in a single LVCSR system. This paper uses a state-of-the-art Mandarin recognition system as a platform to study the interaction of all three techniques. Experiments in the broadcast news and broadcast conversation domains show that the contribution of each technique is nonredundant, and that the full combination yields the best performance and has good domain generalization.

[1]  Andreas Stolcke,et al.  Improved discriminative training using phone lattices , 2005, INTERSPEECH.

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  Andreas Stolcke,et al.  An efficient repair procedure for quick transcriptions , 2004, INTERSPEECH.

[4]  Daniel P. W. Ellis,et al.  Tandem acoustic modeling in large-vocabulary recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Daniel Povey,et al.  Large scale MMIE training for conversational telephone speech recognition , 2000 .

[6]  Andreas Stolcke,et al.  Using MLP features in SRI's conversational speech recognition system , 2005, INTERSPEECH.

[7]  Erik McDermott,et al.  Discriminative Training for Speech Recognition , 1997 .

[8]  Andreas Stolcke,et al.  Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Andreas Stolcke,et al.  INCORPORATING TANDEM/HATS MLP FEATURES INTO SRI'S CONVERSATIONAL SPEECH RECOGNITION SYSTEM , 2004 .

[12]  Richard M. Schwartz,et al.  Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Wen Wang,et al.  Advances in Mandarin broadcast speech recognition , 2007, INTERSPEECH.

[14]  Geoffrey Zweig,et al.  fMPE: discriminatively trained features for speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Haiping Li,et al.  Recognize tone languages using pitch information on the main vowel of each syllable , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).