Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks

Deep bidirectional RNNs (DBRNNs) are applied to ASR error detection and accuracy estimation.DBRNNs take longer bidirectional context of input feature vectors into account.DBRNNs model highly nonlinear relationships between input feature vectors and output labels.DBRNNs greatly outperform CRFs and other structures of neural networks. Recurrent neural networks (RNNs) have recently been applied as the classifiers for sequential labeling problems. In this paper, deep bidirectional RNNs (DBRNNs) are applied to error detection in automatic speech recognition (ASR), which is a sequential labeling problem. We investigate three types of ASR error detection tasks, i.e. confidence estimation, out-of-vocabulary word detection and error type classification. We also estimate ASR accuracy, i.e. percent correct and word accuracy, from the error type classification results. Experimental results using English and Japanese lecture speech corpora show that the DBRNNs greatly outperform conditional random fields (CRFs) and the other NN structures, i.e. deep feedforward NNs (DNNs) and deep unidirectional RNNs (DURNNs). These performance improvements are because the DBRNNs can take the longer bidirectional context of input feature vectors into account and can model highly nonlinear relationships between the input feature vectors and output labels. In detailed analyses, the DBRNNs show a better generalization ability than the CRFs. These results are thanks to the ability of the DBRNNs to represent (embed) the words in a low-dimensional continuous value vector space. In addition, the superiority of the DBRNNs to the DNNs and DURNNs indicates that the average length of the context of the input feature vectors required for ASR error detection is only a few time steps, however, it will change (lengthen) depending on the situation.

[1]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[2]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[3]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[4]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Atsushi Nakamura,et al.  Real-time one-pass decoding with recurrent neural network language model for speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Björn W. Schuller,et al.  Social signal classification using deep blstm recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[9]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[10]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[11]  James R. Glass,et al.  Open-Vocabulary Spoken Utterance Retrieval using Confusion Networks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[13]  Philip C. Woodland,et al.  Detecting deletions in ASR output , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Atsunori Ogawa,et al.  ASR error detection and recognition rate estimation using deep bidirectional recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hermann Ney,et al.  Open vocabulary speech recognition with flat hybrid models , 2005, INTERSPEECH.

[16]  Atsunori Ogawa,et al.  Error type classification and word accuracy estimation using alignment features from word confusion network , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hung-An Chang,et al.  Discriminative training of hierarchical acoustic models for large vocabulary continuous speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[19]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[21]  Ralf Schlüter,et al.  Using word probabilities as confidence measures , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[23]  Atsunori Ogawa,et al.  Unsupervised discriminative language modeling using error rate estimator , 2013, INTERSPEECH.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[26]  Shinji Watanabe,et al.  Discriminative training based on an integrated view of MPE and MMI in margin and error space , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Lidia Mangu,et al.  Finding consensus in speech recognition , 2000 .

[28]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[29]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[30]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[31]  Hynek Hermansky,et al.  Posterior-based out of vocabulary word detection in telephone speech , 2009, INTERSPEECH.

[32]  Atsunori Ogawa,et al.  Discriminative recognition rate estimation for N-best list and its application to N-best rescoring , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Tatsuya Kawahara Benchmark test for speech recognition using the Corpus of Spontaneous Japanese , 2003 .

[34]  Geoffrey Zweig,et al.  Recurrent conditional random field for language understanding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Thomas Kemp,et al.  Modelling unknown words in spontaneous speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[37]  Mari Ostendorf,et al.  Using syntactic and confusion network structure for out-of-vocabulary word detection , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[38]  Atsunori Ogawa,et al.  Estimating Speech Recognition Accuracy Based on Error Type Classification , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[40]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Mark Dredze,et al.  Contextual Information Improves OOV Detection in Speech , 2010, NAACL.

[42]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[43]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[45]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[46]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[48]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[49]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Guillaume Gravier,et al.  Is it time to Switch to word embedding and recurrent neural networks for spoken language understanding? , 2015, INTERSPEECH.

[52]  Thomas Schaaf Detection of OOV words using generalized word models and a semantic class language model , 2001, INTERSPEECH.

[53]  Patrick Gros,et al.  CRF-based combination of contextual features to improve a posteriori word-level confidence measures , 2010, INTERSPEECH.

[54]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[55]  Hermann Ney,et al.  On the Relationship Between Bayes Risk and Word Error Rate in ASR , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[56]  Shinji Watanabe,et al.  Automatic determination of acoustic model topology using variational Bayesian estimation and clustering for large vocabulary continuous speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[58]  Atsunori Ogawa,et al.  Recognition rate estimation based on word alignment network and discriminative error type classification , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).