Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks

The standard approach to assess reliability of automatic speech transcriptions is through the use of confidence scores. If accurate, these scores provide a flexible mechanism to flag transcription errors for upstream and downstream applications. One challenging type of errors that recognisers make are deletions. These errors are not accounted for by the standard confidence estimation schemes and are hard to rectify in the upstream and downstream processing. High deletion rates are prominent in limited resource and highly mismatched training/testing conditions studied under IARPA Babel and Material programs. This paper looks at the use of bidirectional recurrent neural networks to yield confidence estimates in predicted as well as deleted words. Several simple schemes are examined for combination. To assess usefulness of this approach, the combined confidence score is examined for untranscribed data selection that favours transcriptions with lower deletion errors. Experiments are conducted using IARPA Babel/Material program languages.

[1]  Steve J. Young,et al.  Still talking to machines (cognitively speaking) , 2010, INTERSPEECH.

[2]  Philip C. Woodland,et al.  Detecting deletions in ASR output , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[4]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[5]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[6]  Yu Wang,et al.  Impact of ASR Performance on Free Speaking Language Assessment , 2018, INTERSPEECH.

[7]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[8]  Richard M. Schwartz,et al.  Enhancing low resource keyword spotting with automatically retrieved web documents , 2015, INTERSPEECH.

[9]  Mark J. F. Gales,et al.  Improving speech recognition and keyword search for low resource languages using web data , 2015, INTERSPEECH.

[10]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Ralf Schlüter,et al.  Using word probabilities as confidence measures , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Adrià Giménez,et al.  Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[14]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Mark J. F. Gales,et al.  Automatic Speech Recognition System Development in the "Wild" , 2018, INTERSPEECH.

[18]  Gunnar Evermann,et al.  Large vocabulary decoding and confidence estimation using word posterior probabilities , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Richard M. Schwartz,et al.  Unsupervised Training on Large Amounts of Broadcast News Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21]  Mitch Weintraub,et al.  Neural-network based measures of confidence for word recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Herbert Gish,et al.  Improved estimation, evaluation and applications of confidence measures for speech recognition , 1997, EUROSPEECH.

[23]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[24]  Richard M. Schwartz,et al.  The 2016 BBN Georgian telephone speech keyword spotting system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[26]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[27]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[28]  Spyridon Matsoukas,et al.  Unsupervised Training on a Large Amount of Arabic Broadcast News Data , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Thomas Schaaf,et al.  Confidence measures for spontaneous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Thomas Schaaf,et al.  Estimating confidence using word lattices , 1997, EUROSPEECH.

[31]  Larry Gillick,et al.  A probabilistic approach to confidence estimation and evaluation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[33]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[35]  Alexander Gruenstein,et al.  Unsupervised Testing Strategies for ASR , 2011, INTERSPEECH.