DFKI System Combination with Sentence Ranking at ML4HMT-2011

We present a pilot study on a Hybrid Machine Translation system that takes advantag e of multilateral system-specific metadata provided as part of the shared task. The proposed solution offers a machine learning approach, resulting into a selection mechanism able to learn and rank system outputs on the sentence level, based on their quality. For training, due to the lack of human annotations, word-level Levenshtein distance has been used as a quality indicator, whereas a rich set of sentence features was extracted and selected from the dataset. Three classification algo

[1]  Nizar Habash,et al.  Generation-Heavy Hybrid Machine Translation , 2002, INLG.

[2]  L. Ceriani,et al.  The origins of the Gini index: extracts from Variabilità e Mutabilità (1912) by Corrado Gini , 2012 .

[3]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[4]  Chris Callison-Burch,et al.  A program for automatically selecting the best output from multiple machine translation engines , 2001, MTSUMMIT.

[5]  Chris Quirk,et al.  Training a Sentence-Level Machine Translation Confidence Measure , 2004, LREC.

[6]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[7]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[8]  Giuseppe Riccardi,et al.  Computing consensus translation from multiple machine translation systems , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[9]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Hermann Ney,et al.  iROVER: Improving System Combination with Classification , 2007, NAACL.

[12]  Christian Federmann,et al.  Stochastic Parse Tree Selection for an Existing RBMT System , 2011, WMT@EMNLP.

[13]  Nello Cristianini,et al.  Estimating the Sentence-Level Quality of Machine Translation Systems , 2009, EAMT.

[14]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[15]  Mariona Taulé,et al.  AnCora: Multilevel Annotated Corpora for Catalan and Spanish , 2008, LREC.

[16]  Richard M. Schwartz,et al.  Combining Outputs from Multiple Machine Translation Systems , 2007, NAACL.

[17]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.