Better Evaluation of ASR in Speech Translation Context Using Word Embeddings

This paper investigates the evaluation of ASR in spoken language translation context. More precisely, we propose a simple extension of WER metric in order to penalize differently substitution errors according to their context using word embeddings. For instance, the proposed metric should catch near matches (mainly morphological variants) and penalize less this kind of error which has a more limited impact on translation performance. Our experiments show that the correlation of the new proposed metric with SLT performance is better than the one of WER. Oracle experiments are also conducted and show the ability of our metric to find better hypotheses (to be translated) in the ASR N-best. Finally, a preliminary experiment where ASR tuning is based on our new metric shows encouraging results. For reproductible experiments, the code allowing to call our modified WER and the corpora used are made available to the research community.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Benjamin Lecouteux,et al.  Word confidence estimation for speech translation , 2014, IWSLT.

[3]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[4]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[5]  Marcello Federico,et al.  Phonetically-oriented word error alignment for speech recognition error analysis in speech translation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  Hervé Blanchon,et al.  The LIG Machine Translation System for WMT 2010 , 2010, WMT@ACL.

[7]  Gerald Penn,et al.  Automatic human utility evaluation of ASR systems: does WER really predict performance? , 2013, INTERSPEECH.

[8]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Olivier Pietquin,et al.  MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP , 2016, LREC.

[12]  Frédéric Béchet,et al.  "speech Is Silver, but Silence Is Golden": Improving Speech-to-speech Translation Performance by Slashing Users Input , 2015, INTERSPEECH.

[13]  Sebastian Stüker,et al.  Overview of the IWSLT 2012 evaluation campaign , 2012, IWSLT.

[14]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Josef van Genabith,et al.  Machine Translation Evaluation using Recurrent Neural Networks , 2015, WMT@EMNLP.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[20]  Li Deng,et al.  Why word error rate is not a good metric for speech recognizer training for the speech translation task? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.

[22]  Jun-Ping Ng,et al.  Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.

[23]  Mihaela Vela,et al.  Predicting Machine Translation Adequacy with Document Embeddings , 2015, WMT@EMNLP.