Speech Recognition System Combination for Machine Translation

The majority of state-of-the-art speech recognition systems make use of system combination. The combination approaches adopted have traditionally been tuned to minimising word error rates (WERs). In recent years there has been a growing interest in taking the output from speech recognition systems in one language and translating it into another. This paper investigates the use of cross-site combination approaches in terms of both WER and impact on translation performance. In addition, the stages involved in modifying the output from a speech-to-text (STT) system to be suitable for translation are described. Two source languages, Mandarin and Arabic, are recognised and then translated using a phrase-based statistical machine translation system into English. Performance of individual systems and cross-site combination using cross-adaptation and ROVER are given. Results show that the best STT combination scheme in terms of WER is not necessarily the most appropriate when translating speech.

[1]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[2]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[3]  Dongxin Xu,et al.  The BBN Mandarin broadcast news transcription system , 2005, INTERSPEECH.

[4]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[5]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[6]  Sherif Abdou,et al.  The BBN RT04 English broadcast news transcription system , 2005, INTERSPEECH.

[7]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[8]  Richard M. Schwartz,et al.  Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[12]  Mark J. F. Gales,et al.  The Cu-Htk Mandarin Broadcast News Transcription System , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[14]  Andreas Stolcke,et al.  Automatic linguistic segmentation of conversational speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.