Sequential Classification Criteria for NNs in Automatic Speech Recognition

Neural networks (NNs) are discriminative classifiers which have been successfully integrated with hidden Markov models (HMMs), either in the hybrid NN/HMM or tandem connectionist systems. Typically, the NNs are trained with the framebased cross-entropy criterion to classify phonemes or phoneme states. However, for word recognition, the word error rate is more closely related to the sequence classification criteria, such as maximum mutual information and minimum phone error. In this paper, the lattice-based sequence classification criteria are used to train the NNs in the hybrid NN/HMM system and the tandem system. A product-of-expert-based factorization and smoothing scheme is proposed for the hybrid system to scale the lattice-based NN training up to 6000 triphone states. Experimental results on the WSJCAM0 reveal that the NNs trained with the sequential classification criterion yield a 24.2% relative improvement compared to the cross-entropy trained NNs for the hybrid system.

[1]  Anders Krogh,et al.  Hidden Neural Networks , 1999, Neural Computation.

[2]  Georg Heigold,et al.  A discriminative splitting criterion for phonetic decision trees , 2010, INTERSPEECH.

[3]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Mark J. F. Gales,et al.  Canonical state models for automatic speech recognition , 2010, INTERSPEECH.

[5]  Khe Chai Sim,et al.  Discriminative Product-of-Expert acoustic mapping for cross-lingual phone recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[7]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[9]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Khe Chai Sim Probabilistic state clustering using conditional random field for context-dependent acoustic modelling , 2010, INTERSPEECH.

[11]  Finn Tore Johansen,et al.  A comparison of hybrid HMM architecture using global discriminating training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).