Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks

The standard approach to assess reliability of automatic speech transcriptions is through the use of confidence scores. If accurate, these scores provide a flexible mechanism to flag transcription errors for upstream and downstream applications. One challenging type of errors that recognisers make are deletions. These errors are not accounted for by the standard confidence estimation schemes and are hard to rectify in the upstream and downstream processing. High deletion rates are prominent in limited resource and highly mismatched training/testing conditions studied under IARPA Babel and Material programs. This paper looks at the use of bidirectional recurrent neural networks to yield confidence estimates in predicted as well as deleted words. Several simple schemes are examined for combination. To assess usefulness of this approach, the combined confidence score is examined for untranscribed data selection that favours transcriptions with lower deletion errors. Experiments are conducted using IARPA Babel/Material program languages.

[1]  Steve J. Young,et al.  Still talking to machines (cognitively speaking) , 2010, INTERSPEECH.

[2]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Yu Wang,et al.  Impact of ASR Performance on Free Speaking Language Assessment , 2018, INTERSPEECH.

[5]  Spyridon Matsoukas,et al.  Unsupervised Training on a Large Amount of Arabic Broadcast News Data , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[8]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[10]  Larry Gillick,et al.  A probabilistic approach to confidence estimation and evaluation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Adrià Giménez,et al.  Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[13]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[14]  Richard M. Schwartz,et al.  The 2016 BBN Georgian telephone speech keyword spotting system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[20]  Philip C. Woodland,et al.  Detecting deletions in ASR output , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[22]  Mark J. F. Gales,et al.  Improving speech recognition and keyword search for low resource languages using web data , 2015, INTERSPEECH.

[23]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[24]  Alexander Gruenstein,et al.  Unsupervised Testing Strategies for ASR , 2011, INTERSPEECH.

[25]  Ralf Schlüter,et al.  Using word probabilities as confidence measures , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[26]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[27]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[28]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[29]  Richard M. Schwartz,et al.  Unsupervised Training on Large Amounts of Broadcast News Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[30]  Herbert Gish,et al.  Improved estimation, evaluation and applications of confidence measures for speech recognition , 1997, EUROSPEECH.

[31]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Mark J. F. Gales,et al.  Automatic Speech Recognition System Development in the "Wild" , 2018, INTERSPEECH.

[33]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[34]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.