Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling

Automatic speech recognition (ASR) systems lack joint optimization during decoding over the acoustic, lexical and language models; for instance the ASR will often prune words due to acoustics using short-term context, prior to rescoring with long-term context. In this work we model the automated speech transcription process as a noisy transformation channel and propose an error correction system that can learn from the aggregate errors of all the independent modules constituting the ASR. The proposed system can exploit long-term context using a neural network language model and can better choose between existing ASR output possibilities as well as re-introduce previously pruned and unseen (out-of-vocabulary) phrases. The system provides significant corrections under poorly performing ASR conditions without degrading any accurate transcriptions. The proposed system can thus be independently optimized and post-process the output of even a highly optimized ASR. We show that the system consistently provides improvements over the baseline ASR. We also show that it performs better when used on out-of-domain and mismatched test data and under high-error ASR conditions. Finally, an extensive analysis of the type of errors corrected by our system is presented.

[1]  Brian Roark,et al.  Phrasal Cohort Based Unsupervised Discriminative Language Modeling , 2012, INTERSPEECH.

[2]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[3]  Jan Noyes,et al.  Errors and error correction in automatic speech recognition systems , 1994 .

[4]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[5]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6]  Brian Roark,et al.  Hallucinated n-best lists for discriminative language modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[8]  Douglas E. Appelt,et al.  Combining Linguistic and Statistical Knowledge Sources in Natural-Language Processing for ATIS , 1995 .

[9]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[13]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[14]  Yue Wu,et al.  Directed automatic speech transcription error correction using bidirectional LSTM , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[15]  Hong-Kwang Jeff Kuo,et al.  Phrase-based language models for speech recognition , 1999, EUROSPEECH.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[20]  Brian Roark,et al.  Semi-supervised discriminative language modeling for Turkish ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[22]  S. Moss Listen , 2017 .

[23]  David D. Palmer,et al.  Context-based Speech Recognition Error Detection and Correction , 2004, NAACL.

[24]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Tetsuya Takiguchi,et al.  Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance , 2015, IJCAI.

[26]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[27]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[28]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[29]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Shrikanth S. Narayanan,et al.  Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[31]  Mirjam Wester,et al.  Pronunciation modeling for ASR - knowledge-based and data-derived methods , 2003, Comput. Speech Lang..

[32]  Murat Saraclar,et al.  Performance Comparison of Training Algorithms for Semi-Supervised Discriminative Language Modeling , 2012, INTERSPEECH.

[33]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[34]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[35]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[36]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[37]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[38]  Horia Cucu,et al.  Statistical Error Correction Methods for Domain-Specific ASR Systems , 2013, SLSP.

[39]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[40]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[41]  Tetsuya Takiguchi,et al.  Error correction of automatic speech recognition based on normalized web distance , 2014, INTERSPEECH.

[42]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[43]  Tetsuya Takiguchi,et al.  Two-step correction of speech recognition errors based on n-gram and long contextual information , 2013, INTERSPEECH.

[44]  Ashish Vaswani,et al.  Decoding with Large-Scale Neural Language Models Improves Translation , 2013, EMNLP.

[45]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Alexander H. Waibel,et al.  Multimodal error correction for speech user interfaces , 2001, TCHI.

[47]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[48]  Masafumi Nishimura,et al.  Training of error-corrective model for ASR without using audio data , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Keith B. Hall,et al.  Refr: an open-source reranker framework , 2013, INTERSPEECH.

[50]  Rafael E. Banchs,et al.  Automatic Correction of ASR Outputs by Using Machine Translation , 2016, INTERSPEECH.

[51]  Eric K. Ringger,et al.  Error correction via a post-processor for continuous speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[52]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[53]  William A. Ainsworth,et al.  Feedback Strategies for Error Correction in Speech Recognition Systems , 1992, Int. J. Man Mach. Stud..

[54]  Hermann Ney,et al.  Investigations on Phrase-based Decoding with Recurrent Neural Network Language and Translation Models , 2015, WMT@EMNLP.

[55]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[56]  Panayiotis G. Georgiou,et al.  Automatic speech recognition system channel modeling , 2010, INTERSPEECH.

[57]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[58]  Barry Haddow,et al.  Improved Minimum Error Rate Training in Moses , 2009, Prague Bull. Math. Linguistics.

[59]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[60]  Gary Geunbae Lee,et al.  Speech recognition error correction using maximum entropy language model , 2004, INTERSPEECH.

[61]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Taro Watanabe,et al.  Optimization for Statistical Machine Translation: A Survey , 2016, CL.