An Evaluation of Word-Level Confidence Estimation for End-to-End Automatic Speech Recognition

Quantifying the confidence (or conversely the uncertainty) of a prediction is a highly desirable trait of an automatic system, as it improves the robustness and usefulness in downstream tasks. In this paper we investigate confidence estimation for end-toend automatic speech recognition (ASR). Previous work has addressed confidence measures for lattice-based ASR, while current machine learning research mostly focuses on confidence measures for unstructured deep learning. However, as the ASR systems are increasingly being built upon deep end-to-end methods, there is little work that tries to develop confidence measures in this context. We fill this gap by providing an extensive benchmark of popular confidence methods on four well-known speech datasets. There are two challenges we overcome in adapting existing methods: working on structured data (sequences) and obtaining confidences at a coarser level than the predictions (words instead of tokens). Our results suggest that a strong baseline can be obtained by scaling the logits by a learnt temperature, followed by estimating the confidence as the negative entropy of the predictive distribution and, finally, sum pooling to aggregate at word level.

[1]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[2]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[3]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[4]  George Saon,et al.  Advancing Sequence-to-Sequence Based Speech Recognition , 2019, INTERSPEECH.

[5]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Karthikeyan Shanmugam,et al.  Confidence Scoring Using Whitebox Meta-models with Linear Classifier Probes , 2018, AISTATS.

[7]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[8]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[9]  Yee Whye Teh,et al.  Do Deep Generative Models Know What They Don't Know? , 2018, ICLR.

[10]  Adrià Giménez,et al.  Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Hassan Ouahmane,et al.  Improving ASR Error Detection with RNNLM Adaptation , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Hervé Bourlard,et al.  Analyzing Uncertainties in Speech Recognition Using Dropout , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[16]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[17]  Alfons Juan-Císcar,et al.  Speaker-adapted confidence measures for speech recognition of video lectures , 2016, Comput. Speech Lang..

[18]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[21]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[22]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[23]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[24]  Matt Shannon,et al.  Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[25]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[26]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[27]  Gopinath Chennupati,et al.  On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks , 2019, NeurIPS.

[28]  Xiaofeng Liu,et al.  Deep Verifier Networks: Verification of Deep Discriminative Models with Deep Generative Models , 2019, AAAI.

[29]  Matthieu Cord,et al.  Addressing Failure Prediction by Learning Model Confidence , 2019, NeurIPS.

[30]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[31]  Atsunori Ogawa,et al.  Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks , 2017, Speech Commun..

[32]  Mark J. F. Gales,et al.  Bi-directional Lattice Recurrent Neural Networks for Confidence Estimation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[35]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[36]  Hermann Ney,et al.  On Using 2D Sequence-to-sequence Models for Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Lukás Burget,et al.  Semi-supervised training of Deep Neural Networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[38]  Matthias Sperber,et al.  Neural Lattice-to-Sequence Models for Uncertain Inputs , 2017, EMNLP.

[39]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[40]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[41]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[42]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  M. Gales,et al.  Uncertainty in Structured Prediction , 2020, ArXiv.

[44]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Roland Maas,et al.  Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings , 2019, INTERSPEECH.

[46]  Dmitry Vetrov,et al.  Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning , 2020, ICLR.