Analyzing Uncertainties in Speech Recognition Using Dropout

The performance of Automatic Speech Recognition (ASR) systems is often measured using Word Error Rates (WER) which requires time-consuming and expensive manually transcribed data. In this paper, we use state-of-the-art ASR systems based on Deep Neural Networks (DNN) and propose a novel framework which uses "Dropout" at the test time to model uncertainty in prediction hypotheses. We systematically exploit this uncertainty to estimate WER without the need for explicit transcriptions. In addition, we show that the predictive uncertainty can also be used to accurately localize the errors made by the ASR system. We study the performance of our approach on Switchboard database where it predicts WER accurately within a range of 2.6% and 5.0% for HMM-DNN and Connectionist Temporal Classification (CTC) ASR systems, respectively.

[1]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[2]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[3]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Roberto Cipolla,et al.  Modelling uncertainty in deep learning for camera relocalization , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[6]  Steve Renals,et al.  Word Error Rate Estimation for Speech Recognition: e-WER , 2018, ACL.

[7]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[9]  Patrick Gros,et al.  CRF-based combination of contextual features to improve a posteriori word-level confidence measures , 2010, INTERSPEECH.

[10]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[11]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[13]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[15]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[16]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Ralf Schlüter,et al.  Using word probabilities as confidence measures , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[18]  Atsunori Ogawa,et al.  Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks , 2017, Speech Commun..

[19]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[22]  Atsunori Ogawa,et al.  Error type classification and word accuracy estimation using alignment features from word confusion network , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).