DCU-Symantec Submission for the WMT 2012 Quality Estimation Task

This paper describes the features and the machine learning methods used by Dublin City University (DCU) and SYMANTEC for the WMT 2012 quality estimation task. Two sets of features are proposed: one constrained, i.e. respecting the data limitation suggested by the workshop organisers, and one unconstrained, i.e. using data or tools trained on data that was not provided by the workshop organisers. In total, more than 300 features were extracted and used to train classifiers in order to predict the translation quality of unseen data. In this paper, we focus on a subset of our feature set that we consider to be relatively novel: features based on a topic model built using the Latent Dirichlet Allocation approach, and features based on source and target language syntax extracted using part-of-speech (POS) taggers and parsers. We evaluate nine feature combinations using four classification-based and four regression-based machine learning techniques.

[1]  Georges Linarès,et al.  A Multi-view Approach for Term Translation Spotting , 2011, CICLing.

[2]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[4]  Patrick Wambacq,et al.  Confidence scoring based on backward language models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Ian H. Witten,et al.  Induction of model trees for predicting continuous classes , 1996 .

[7]  Chris Quirk,et al.  Training a Sentence-Level Machine Translation Confidence Measure , 2004, LREC.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Guodong Zhou,et al.  Improve SMT with Source-Side “Topic-Document” Distributions , 2011, MTSUMMIT.

[11]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[12]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[13]  Joachim Wagner,et al.  Detecting grammatical errors with treebank-induced, probabilistic parsers , 2012 .

[14]  Josef van Genabith,et al.  Judging Grammaticality: Experiments in Sentence Classification , 2013, CALICO Journal.

[15]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[16]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[17]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[18]  Josef van Genabith,et al.  A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors , 2007, EMNLP.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Tanja Schultz,et al.  Bilingual LSA-based adaptation for statistical machine translation , 2007, Machine Translation.

[21]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[22]  Yu Zhang,et al.  Statistical Machine Translation based on LDA , 2010, 2010 4th International Universal Communication Symposium.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[25]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[26]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[27]  Michael Gamon,et al.  A Machine Learning Approach to the Automatic Evaluation of Machine Translation , 2001, ACL.

[28]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[29]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[30]  Rebecca Hwa,et al.  A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation , 2007, ACL.

[31]  Miriam Butt,et al.  The Parallel Grammar Project , 2002, COLING 2002.

[32]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[33]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[34]  Andy Way,et al.  Labelled Dependencies in Machine Translation Evaluation , 2007, WMT@ACL.