Improved clustered hierarchical tandem system with bottom-up processing

The outputs of multi-layer perceptron (MLP) classifiers have been successfully used in tandem systems as features for HMM-based automatic speech recognition. In a previous paper, we proposed Data-driven Clustered Hierarchical MLP (CHMLP) tandem system yielding improved performance by dividing the complicated global phone classification problem into simpler hierarchical tasks, in which specialized MLPs are trained to classify small clusters of confusing phones in a hierarchical structure. In this paper a bottom-up processing is further proposed to enhance the classification in the above CHMLP and offer even better performance. MLP rescoring for the tandem system is also investigated. The best result achieved 19.1% relative error reduction over the MFCC baseline.

[1]  Hynek Hermansky,et al.  Hierarchical tandem feature extraction , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[3]  Frantisek Grézl,et al.  Improved MLP structures for data-driven feature extraction for ASR , 2005, INTERSPEECH.

[4]  Chin-Hui Lee,et al.  High-Accuracy Phone Recognition By Combining High-Performance Lattice Generation and Knowledge Based Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Christos Antoniou,et al.  Modular neural networks exploit large acoustic context through broad-class posteriors for continuous speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Fabio Valente,et al.  Hierarchical and parallel processing of modulation spectrum for ASR applications , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[8]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Eric Fosler-Lussier,et al.  Crandem systems: Conditional random field acoustic models for hidden Markov models , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Andreas Stolcke,et al.  INCORPORATING TANDEM/HATS MLP FEATURES INTO SRI'S CONVERSATIONAL SPEECH RECOGNITION SYSTEM , 2004 .

[11]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Hervé Bourlard,et al.  Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Hynek Hermansky TRAP-TANDEM: data-driven extraction of temporal features from speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[14]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Lin-Shan Lee,et al.  Data-driven clustered hierarchical tandem system for LVCSR , 2008, INTERSPEECH.