Tonotopic multi-layered perceptron: a neural network for learning long-term temporal features for speech recognition

We have been reducing word error rates (WER) on conversational telephone speech (CTS) tasks by capturing long-term (/spl sim/500ms) temporal information using multilayered perceptrons (MLP). In this paper we experiment with an MLP architecture called tonotopic MLP (TMLP), incorporating two hidden layers. The first of these is tonotopically organized: for each critical band, there is a disjoint set of hidden units that use the long-term energy trajectory as the input. Thus, each of these subsets of hidden units learns to discriminate single band energy trajectory patterns. The rest of the layers are fully connected to their inputs. When used in combination with an intermediate-term (/spl sim/100ms) MLP system to augment standard PLP features, the TMLP reduces the WER on the 2001 Nist Hub-5 CTS evaluation set (Eval2001) by 8.87% relative. We show some practical advantages over our previous methods. We also report results from a series of experiments to determine the best ranges of hidden layer sizes and total parameters with respect to the number of training patterns for this task and architecture.

[1]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Sarel van Vuuren,et al.  Relevance of time-frequency features for phonetic and speaker-channel classification , 2000, Speech Commun..

[3]  Andreas Stolcke,et al.  Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Andreas Stolcke,et al.  THE SRI MARCH 2000 HUB-5 CONVERSATIONAL SPEECH TRANSCRIPTION SYSTEM , 2000 .

[5]  Hynek Hermansky,et al.  Data-Derived Non-Linear Mapping for Feature Extraction in HMM , 1999 .

[6]  Jeff A. Bilmes,et al.  Maximum mutual information based reduction strategies for cross-correlation based joint distributional modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[8]  Nelson Morgan,et al.  Learning long-term temporal features in LVCSR using neural networks , 2004, INTERSPEECH.

[9]  Daniel P. W. Ellis,et al.  Size matters: an empirical study of neural network training for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).