Tandem connectionist feature extraction for conventional HMM systems

Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In this work we show a large improvement in word recognition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By training the network to generate the subword probability posteriors, then using transformations of these estimates as the base features for a conventionally-trained Gaussian-mixture based system, we achieve relative error rate reductions of 35% or more on the multicondition Aurora noisy continuous digits task.

[1]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[2]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[3]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[4]  Jean-Marc Boite,et al.  Nonlinear discriminant analysis for improved speech recognition , 1997, EUROSPEECH.

[5]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Nelson Morgan,et al.  Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments , 1998 .

[7]  Gerhard Rigoll,et al.  A NN/HMM hybrid for continuous speech recognition with a discriminant nonlinear feature extraction , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Daniel P. W. Ellis,et al.  Speech/music discrimination based on posterior probability features , 1999, EUROSPEECH.

[9]  Hynek Hermansky,et al.  Data-Derived Non-Linear Mapping for Feature Extraction in HMM , 1999 .

[10]  Daniel P. W. Ellis,et al.  Multi-stream speech recognition: ready for prime time? , 1999, EUROSPEECH.

[11]  Daniel P. W. Ellis,et al.  Feature extraction using non-linear transformation for robust speech recognition on the Aurora database , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).