On using MLP features in LVCSR

One of the major research thrusts in the speech group at ICSI is to use Multi-Layer Perceptron (MLP) based features in automatic speech recognition (ASR). This paper presents a study of three aspects of this effort: 1) the properties of the MLP features which make them useful, 2) incorporating MLP features together with PLP features in ASR, and 3) possible redundancy between MLP features and more conventional system refinements such as discriminative training and system combination. The paper shows that MLP transformations yield variables that have regular distributions, which can be further modified by using logarithm to make the distribution easier to model by a Gaussian-HMM. Two or more vectors of these features can easily be combined without increasing the feature dimension. Recognition results show that MLP features can significantly improve recognition performance in large vocabulary continuous speech recognition (LVCSR) tasks for the NIST 2001 Hub-5 evaluation set with models trained on the Switchboard Corpus, even when discriminative training and system combination are used.

[1]  Daniel P. W. Ellis,et al.  Connectionist speech recognition of Broadcast News , 2002, Speech Commun..

[2]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[3]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[4]  Andreas Stolcke,et al.  Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[6]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[7]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Hynek Hermansky,et al.  Robust ASR front-end using spectral-based and discriminant features: experiments on the Aurora tasks , 2001, INTERSPEECH.

[9]  Daniel P. W. Ellis,et al.  Error visualization for tandem acoustic modeling on the Aurora task , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  S. Chen,et al.  The IBM LVCSR System Used for 1998 Mandarin Broadcast News Transcription Evaluation , 1999 .