Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Multi-Layer Perceptrons (MLPs) can be used in automatic speech recognition in many ways. A particular application of this tool over the last few years has been the Tandem approach, as described in [7] and other more recent publications. Here we discuss the characteristics of the MLP-based features used for the Tandem approach, and conclude with a report on their application to conversational speech recognition. The paper shows that MLP transformations yield variables that have regular distributions, which can be further modified by using logarithm to make the distribution easier to model by a Gaussian-HMM. Two or more vectors of these features can easily be combined without increasing the feature dimension. We also report recognition results that show that MLP features can significantly improve recognition performance for the NIST 2001 Hub-5 evaluation set with models trained on the Switchboard Corpus, even for complex systems incorporating MMIE training and other enhancements.

[1]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Hervé Bourlard,et al.  Continuous speech recognition , 1995, IEEE Signal Process. Mag..

[3]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[4]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[5]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Hynek Hermansky,et al.  Robust ASR front-end using spectral-based and discriminant features: experiments on the Aurora tasks , 2001, INTERSPEECH.

[7]  Daniel P. W. Ellis,et al.  Connectionist speech recognition of Broadcast News , 2002, Speech Commun..

[8]  Daniel P. W. Ellis,et al.  Error visualization for tandem acoustic modeling on the Aurora task , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[11]  Andreas Stolcke,et al.  Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.