Region Dependent Transform on MLP Features for Speech Recognition

In this work, Region Dependent Transform (RDT) is used as a feature extraction process to combine the traditional short-term acoustic features with the features derived from Multi-Layer Perceptrons (MLP) which is trained from the long-term features. When compared to the conventional feature augmentation approach, substantial improvement is obtained. Moreover, an improved RDT training procedure in which speaker dependent transforms are take into account is proposed for feature combinination in the Speaker Adaptive Training. By incorporating the higher dimensional features output from the layer prior to the bottleneck layer into our Speech-to-Text (STT) system using RDT, significant improvement is achieved as compared to using the conventional bottleneck features. In summary, by using the features derived from MLP with RDT, 8.2% to 11.4% relative reduction in Character Error Rate is achieved for our Mandarin STT systems.

[1]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2]  Richard M. Schwartz,et al.  Recent progress on the discriminative region-dependent transform for speech feature extraction , 2006, INTERSPEECH.

[3]  Long Nguyen,et al.  Progress in the BBN 2007 Mandarin Speech to Text system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Jean-Luc Gauvain,et al.  Transcribing broadcast data using MLP features , 2008, INTERSPEECH.

[5]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[6]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[7]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Tim Ng,et al.  Jointly optimized discriminative features for speech recognition , 2010, INTERSPEECH.

[10]  Richard M. Schwartz,et al.  Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Mark J. F. Gales,et al.  Training and adapting MLP features for Arabic speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[13]  Andreas Stolcke,et al.  Improved discriminative training using phone lattices , 2005, INTERSPEECH.