论文信息 - Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling

Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling

Automatic speech recognition (ASR) systems lack joint optimization during decoding over the acoustic, lexical and language models; for instance the ASR will often prune words due to acoustics using short-term context, prior to rescoring with long-term context. In this work we model the automated speech transcription process as a noisy transformation channel and propose an error correction system that can learn from the aggregate errors of all the independent modules constituting the ASR. The proposed system can exploit long-term context using a neural network language model and can better choose between existing ASR output possibilities as well as re-introduce previously pruned and unseen (out-of-vocabulary) phrases. The system provides significant corrections under poorly performing ASR conditions without degrading any accurate transcriptions. The proposed system can thus be independently optimized and post-process the output of even a highly optimized ASR. We show that the system consistently provides improvements over the baseline ASR. We also show that it performs better when used on out-of-domain and mismatched test data and under high-error ASR conditions. Finally, an extensive analysis of the type of errors corrected by our system is presented.

[1] Brian Roark,et al. Phrasal Cohort Based Unsupervised Discriminative Language Modeling , 2012, INTERSPEECH.

[2] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[3] Jan Noyes,et al. Errors and error correction in automatic speech recognition systems , 1994 .

[4] Tara N. Sainath,et al. Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[5] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[6] Brian Roark,et al. Hallucinated n-best lists for discriminative language modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] David Miller,et al. The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[8] Douglas E. Appelt,et al. Combining Linguistic and Statistical Knowledge Sources in Natural-Language Processing for ATIS , 1995 .

[9] Stephen Cox,et al. Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11] Yun Lei,et al. ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] R. Rosenfeld,et al. Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[13] Paul Deléglise,et al. Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[14] Yue Wu,et al. Directed automatic speech transcription error correction using bidirectional LSTM , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[15] Hong-Kwang Jeff Kuo,et al. Phrase-based language models for speech recognition , 1999, EUROSPEECH.

[16] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[18] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[19] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[20] Brian Roark,et al. Semi-supervised discriminative language modeling for Turkish ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Brian Roark,et al. Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[22] S. Moss. Listen , 2017 .

[23] David D. Palmer,et al. Context-based Speech Recognition Error Detection and Correction , 2004, NAACL.

[24] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25] Tetsuya Takiguchi,et al. Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance , 2015, IJCAI.

[26] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[27] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[28] Daniel Povey,et al. Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[29] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Shrikanth S. Narayanan,et al. Improving speech recognition for children using acoustic adaptation and pronunciation modeling , 2014, WOCCI.

[31] Mirjam Wester,et al. Pronunciation modeling for ASR - knowledge-based and data-derived methods , 2003, Comput. Speech Lang..

[32] Murat Saraclar,et al. Performance Comparison of Training Algorithms for Semi-Supervised Discriminative Language Modeling , 2012, INTERSPEECH.

[33] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[34] Richard M. Schwartz,et al. Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[35] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[36] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[37] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[38] Horia Cucu,et al. Statistical Error Correction Methods for Domain-Specific ASR Systems , 2013, SLSP.

[39] Helmer Strik,et al. Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[40] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[41] Tetsuya Takiguchi,et al. Error correction of automatic speech recognition based on normalized web distance , 2014, INTERSPEECH.

[42] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[43] Tetsuya Takiguchi,et al. Two-step correction of speech recognition errors based on n-gram and long contextual information , 2013, INTERSPEECH.

[44] Ashish Vaswani,et al. Decoding with Large-Scale Neural Language Models Improves Translation , 2013, EMNLP.

[45] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Alexander H. Waibel,et al. Multimodal error correction for speech user interfaces , 2001, TCHI.

[47] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[48] Masafumi Nishimura,et al. Training of error-corrective model for ASR without using audio data , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Keith B. Hall,et al. Refr: an open-source reranker framework , 2013, INTERSPEECH.

[50] Rafael E. Banchs,et al. Automatic Correction of ASR Outputs by Using Machine Translation , 2016, INTERSPEECH.

[51] Eric K. Ringger,et al. Error correction via a post-processor for continuous speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[52] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[53] William A. Ainsworth,et al. Feedback Strategies for Error Correction in Speech Recognition Systems , 1992, Int. J. Man Mach. Stud..

[54] Hermann Ney,et al. Investigations on Phrase-based Decoding with Recurrent Neural Network Language and Translation Models , 2015, WMT@EMNLP.

[55] Lalit R. Bahl,et al. A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[56] Panayiotis G. Georgiou,et al. Automatic speech recognition system channel modeling , 2010, INTERSPEECH.

[57] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.

[58] Barry Haddow,et al. Improved Minimum Error Rate Training in Moses , 2009, Prague Bull. Math. Linguistics.

[59] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[60] Gary Geunbae Lee,et al. Speech recognition error correction using maximum entropy language model , 2004, INTERSPEECH.

[61] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[62] Taro Watanabe,et al. Optimization for Statistical Machine Translation: A Survey , 2016, CL.