Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

Modeling the errors of a speech recognizer can help simulate errorful recognized speech data from plain text, which has proven useful for tasks like discriminative language modeling, improving robustness of NLP systems, where limited or even no audio data is available at train time. Previous work typically considered replicating behavior of GMM-HMM based systems, but the behavior of more modern posterior-based neural network acoustic models is not the same and requires adjustments to the error prediction model. In this work, we extend a prior phonetic confusion based model for predicting speech recognition errors in two ways: first, we introduce a sampling-based paradigm that better simulates the behavior of a posterior-based acoustic model. Second, we investigate replacing the confusion matrix with a sequence-to-sequence model in order to introduce context dependency into the prediction. We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. Sampling greatly improves predictive accuracy within a 100-guess paradigm, while the sequence model performs similarly to the confusion matrix.

[1]  Mehryar Mohri,et al.  An efficient algorithm for the n-best-strings problem , 2002, INTERSPEECH.

[2]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[3]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[4]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[5]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[6]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7]  Florian Metze,et al.  Augmenting Translation Models with Simulated Acoustic Confusions for Improved Spoken Language Translation , 2014, EACL.

[8]  Eric Fosler-Lussier,et al.  Improving Human-computer Interaction in Low-resource Settings with Text-to-phonetic Data Augmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[10]  Yang Liu,et al.  Improving the Robustness of Speech Translation , 2018, ArXiv.

[11]  Eric Fosler-Lussier,et al.  Discriminative language modeling using simulated ASR errors , 2010, INTERSPEECH.

[12]  Evan Jaffe,et al.  Combining CNNs and Pattern Matching for Question Interpretation in a Virtual Patient Dialogue System , 2017, BEA@EMNLP.

[13]  Javier Hernando,et al.  Detection of confusable words in automatic speech recognition , 2005, IEEE Signal Processing Letters.

[14]  Brian Roark,et al.  Hallucinated n-best lists for discriminative language modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Masafumi Nishimura,et al.  Training of error-corrective model for ASR without using audio data , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Eric Fosler-Lussier,et al.  A framework for predicting speech recognition errors , 2005, Speech Commun..

[17]  Eric Fosler-Lussier,et al.  A comparison of audio-free speech recognition error prediction methods , 2009, INTERSPEECH.

[18]  William Lewis,et al.  Adapting machine translation models toward misrecognized speech with text-to-speech pronunciation rules and acoustic confusability , 2015, INTERSPEECH.

[19]  Panayiotis G. Georgiou,et al.  Automatic speech recognition system channel modeling , 2010, INTERSPEECH.