Automatic Correction of ASR Outputs by Using Machine Translation

One of the main challenges when working with a domainindependent automatic speech recognizers (ASR) is to correctly transcribe rare or out-of-vocabulary words that are not included in the language model or whose probabilities are sub-estimated. Although the common solution would be to adapt the language models and pronunciation vocabularies, in some conditions, like when using free online recognizers, that is not possible and therefore it is necessary to apply postrecognition rectifications. In this paper, we propose an automatic correction procedure based on using a phrase-based machine translation system trained using words and phonetic encoding representations to the generated n-best lists of ASR results. Our experiments on two different datasets: human computer interfaces for robots, and human to human dialogs about tourism information show that the proposed methodology can provide a quick and robust mechanism to improve the performance of the ASR by reducing the word error rate (WER) and character error rate (CER).

[1]  Francisco Casacuberta,et al.  The New Thot Toolkit for Fully-Automatic and Interactive Statistical Machine Translation , 2014, EACL.

[2]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[3]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[4]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[5]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[6]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[7]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[8]  Steve Young A review of large-vocabulary continuous-speech , 1996 .

[9]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[10]  Antoine Raux,et al.  Dialog State Tracking Challenge Handbook , 2012 .

[11]  Lawrence Philips,et al.  The double metaphone search algorithm , 2000 .

[12]  Rafael E. Banchs,et al.  The Fourth Dialog State Tracking Challenge , 2016, IWSDS.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Shankar Kumar,et al.  Large Scale Language Modeling in Automatic Speech Recognition , 2012, ArXiv.

[15]  Roberto Basili,et al.  HuRIC: a Human Robot Interaction Corpus , 2014, LREC.

[16]  Roberto Basili,et al.  Kernel-Based Discriminative Re-ranking for Spoken Command Understanding in HRI , 2013, AI*IA.

[17]  Luca Iocchi,et al.  RoboCup@Home: Scientific Competition and Benchmarking for Domestic Service Robots , 2009 .

[18]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Eiichiro Sumita,et al.  Translating with Examples: A New Approach to Machine Translation , 2005 .

[20]  Mauro Cettolo,et al.  Integrated n-best re-ranking for spoken language translation , 2005, INTERSPEECH.

[21]  Anton Leuski,et al.  Which ASR should I choose for my dialogue system? , 2013, SIGDIAL Conference.