Improving speech translation with automatic boundary prediction

This paper investigates the influence of automatic sentence boundary and sub-sentence punctuation prediction on machine translation (MT) of automatically recognized speech. We use prosodic and lexical cues to determine sentence boundaries, and successfully combine two complementary approaches to sentence boundary prediction. We also introduce a new feature for segmentation prediction that directly considers the assumptions of the phrase translation model. In addition, we show how automatically predicted commas can be used to constrain reordering in MT search. We evaluate the presented methods using a state-of-the-art phrase-based statistical MT system on two large vocabulary tasks. We find that careful optimization of the segmentation parameters directly for translation quality improves the translation results in comparison to independent optimization for segmentation quality of the predicted source language sentence boundaries.

[1]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[2]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[3]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[4]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[5]  Hermann Ney,et al.  Evaluating Machine Translation Output with Automatic Sentence Segmentation , 2005, IWSLT.

[6]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[7]  Hermann Ney,et al.  The RWTH statistical machine translation system for the IWSLT 2006 evaluation , 2006, IWSLT.

[8]  Dilek Z. Hakkani-Tür,et al.  The ICSI+ multilingual sentence segmentation system , 2006, INTERSPEECH.

[9]  Dilek Z. Hakkani-Tür,et al.  Entropy Based Classifier Combination for Sentence Segmentation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Hermann Ney,et al.  Automatic sentence segmentation and punctuation prediction for spoken language translation , 2006, IWSLT.

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[13]  Hermann Ney,et al.  Discriminative Reordering Models for Statistical Machine Translation , 2006, WMT@HLT-NAACL.