Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks

Recently, there has been growth in providers of speech transcription services enabling others to leverage technology they would not normally be able to use. As a result, speech-enabled solutions have become commonplace. Their success critically relies on the quality, accuracy, and reliability of the underlying speech transcription systems. Those black box systems, however, offer limited means for quality control as only word sequences are typically available. This paper examines this limited resource scenario for confidence estimation, a measure commonly used to assess transcription reliability. In particular, it explores what other sources of word and sub-word level information available in the transcription process could be used to improve confidence scores. To encode all such information this paper extends lattice recurrent neural networks to handle sub-words. Experimental results using the IARPA OpenKWS 2016 evaluation system show that the use of additional information yields significant gains in confidence estimation accuracy. The implementation for this model can be found online1.

[1]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mark J. F. Gales,et al.  Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[3]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[4]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[6]  Mark J. F. Gales,et al.  Bi-directional Lattice Recurrent Neural Networks for Confidence Estimation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[8]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[9]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[10]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[13]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[14]  Herbert Gish,et al.  Improved estimation, evaluation and applications of confidence measures for speech recognition , 1997, EUROSPEECH.

[15]  Ariya Rastrow,et al.  LatticeRnn: Recurrent Neural Networks Over Lattices , 2016, INTERSPEECH.

[16]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[17]  Mark J. F. Gales,et al.  Derivative kernels for noise robust ASR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Joel Pinto,et al.  Confidence measures in speech recognition based on probability distribution of likelihoods , 2005, INTERSPEECH.

[19]  Geoffrey Zweig,et al.  A segmental CRF approach to large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Philip C. Woodland,et al.  Speaker adaptation using lattice-based MLLR , 2001 .

[22]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[25]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[26]  Tanja Schultz,et al.  Using word latice information for a tighter coupling in speech translation systems , 2004, INTERSPEECH.

[27]  Richard M. Schwartz,et al.  Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech , 2019, SIGIR.

[28]  Adrià Giménez,et al.  Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[30]  Jean-Luc Gauvain,et al.  Developing STT and KWS systems using limited language resources , 2014, INTERSPEECH.

[31]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Mark J. F. Gales,et al.  Stimulated training for automatic speech recognition and keyword search in limited resource conditions , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .