Phonetically-oriented word error alignment for speech recognition error analysis in speech translation

We propose a variation to the commonly used Word Error Rate (WER) metric for speech recognition evaluation which incorporates the alignment of phonemes, in the absence of time boundary information. After computing the Levenshtein alignment on words in the reference and hypothesis transcripts, spans of adjacent errors are converted into phonemes with word and syllable boundaries and a phonetic Levenshtein alignment is performed. The phoneme alignment information is used to correct the word alignment labels in each error region. We demonstrate that our Phonetically-Oriented Word Error Rate (POWER) yields similar scores to WER with the added advantages of better word alignments and the ability to capture one-to-many alignments corresponding to homophonic errors in speech recognition hypotheses. These improved alignments allow us to better trace the impact of Levenshtein error types in speech recognition on downstream tasks such as speech translation.

[1]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[2]  Richard Zens,et al.  Speech Translation by Confusion Network Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Marcello Federico,et al.  Assessing the impact of speech recognition errors on machine translation quality , 2014, AMTA.

[4]  Florian Metze,et al.  Augmenting Translation Models with Simulated Acoustic Confusions for Improved Spoken Language Translation , 2014, EACL.

[5]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[6]  D. Bates,et al.  Linear Mixed-Effects Models using 'Eigen' and S4 , 2015 .

[7]  F. Casacuberta,et al.  Recent efforts in spoken language translation , 2008, IEEE Signal Processing Magazine.

[8]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[9]  Hermann Ney,et al.  Integrating Speech Recognition and Machine Translation: Where do We Stand? , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  William Lewis,et al.  Adapting machine translation models toward misrecognized speech with text-to-speech pronunciation rules and acoustic confusability , 2015, INTERSPEECH.

[11]  S. R. Searle,et al.  Prediction, Mixed Models, and Variance Components , 1973 .

[12]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[13]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Geoffrey Zweig,et al.  Confidence estimation, OOV detection and language ID using phone-to-word transduction and phone-level alignments , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..