Predicting speech recognition confidence using deep learning with word identity and score features

Confidence classifiers for automatic speech recognition (ASR) provide a quantitative representation for the reliability of ASR decoding. In this paper, we improve the ASR confidence measure performance for an utterance using two distinct approaches: (1) to define and incorporate additional predictors in the confidence classifier including those based on the word identity and on the aggregated words, and (2) to train the confidence classifier built on deep learning architectures including the deep neural network (DNN) and the kernel deep convex network (K-DCN). Our experiments show that adding the new predictors to our multi-layer perceptron (MLP)-based baseline classifier provides 38.6% relative reduction in the correct-reject rate as our measure of the classifier performance. Further, replacing the MLP with the DNN and K-DCN provides an additional 14.5% and 47.5% in the relative performance gain, respectively.

[1]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[2]  Alex Acero,et al.  Maximum Entropy Confidence Estimation for Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Hermann Ney,et al.  A comparison of word graph and n-best list based confidence measures , 1999, EUROSPEECH.

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[6]  Dong Yu,et al.  Scalable stacking and learning for building deep architectures , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[8]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[9]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[10]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Dong Yu,et al.  Deep Convex Net: A Scalable Architecture for Speech Pattern Classification , 2011, INTERSPEECH.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[14]  Gökhan Tür,et al.  Use of kernel deep convex networks and end-to-end learning for spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[16]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[17]  Thomas S. Huang,et al.  Pooling Robust Shift-Invariant Sparse Representations of Acoustic Signals , 2012, INTERSPEECH.

[18]  Po-Sen Huang,et al.  Random features for Kernel Deep Convex Network , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.