MTWatch: A Tool for the Analysis of Noisy Parallel Data

State-of-the-art statistical machine translation (SMT) technique requires a good quality parallel data to build a translation model. The availability of large parallel corpora has rapidly increased over the past decade. However, often these newly developed parallel data contains contain significant noise. In this paper, we describe our approach for classifying good quality parallel sentence pairs from noisy parallel data. We use 10 different features within a Support Vector Machine (SVM)-based model for our classification task. We report a reasonably good classification accuracy and its positive effect on overall MT accuracy.

[1]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[2]  Jörg Tiedemann,et al.  The OPUS corpus : parallel and free , 2004 .

[3]  Yifan He,et al.  Improving the Post-Editing Experience using Translation Recommendation: A User Study , 2010, AMTA.

[4]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[5]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[6]  Andy Way,et al.  OpenMaTrEx: A Free/Open-Source Marker-Driven Example-Based Machine Translation System , 2010, IceTAL.

[7]  Andy Way,et al.  Using Example-Based MT to Support Statistical MT when Translating Homogeneous Data in a Resource-Poor Setting , 2011, EAMT.

[8]  George F. Foster,et al.  The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance , 2012, AMTA.

[9]  Andy Way,et al.  Robust large-scale EBMT with marker-based segmentation , 2004, TMI.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Lucia Specia,et al.  Improving the Confidence of Machine Translation Quality Estimates , 2009, MTSUMMIT.

[12]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[13]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[14]  William Lewis,et al.  Crisis MT: Developing A Cookbook for MT in Crisis Situations , 2011, WMT@EMNLP.

[15]  Matt Post,et al.  Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing , 2012, WMT@NAACL-HLT.

[16]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.