Automatic Detection of Machine Translated Text and Translation Quality Estimation

We show that it is possible to automatically detect machine translated text at sentence level from monolingual corpora, using text classification methods. We show further that the accuracy with which a learned classifier can detect text as machine translated is strongly correlated with the translation quality of the machine translation system that generated it. Finally, we offer a generic machine translation quality estimation technique based on this approach, which does not require reference sentences.

[1]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[4]  Diana Inkpen,et al.  Identification of Translationese: A Machine Learning Approach , 2010, CICLing.

[5]  Dmitriy Fradkin,et al.  Bayesian Multinomial Logistic Regression for Author Identification , 2005, AIP Conference Proceedings.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..

[8]  Chris Quirk,et al.  Gappy Phrasal Alignment By Agreement , 2011, ACL.

[9]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[10]  Kenneth Heafield,et al.  Efficient Language Modeling Algorithms with Applications to Statistical Machine Translation , 2013 .

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Ming Zhou,et al.  Machine Translation Detection from Monolingual Web-Text , 2013, ACL.

[13]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[14]  Shuly Wintner,et al.  On the features of translationese , 2015, Digit. Scholarsh. Humanit..

[15]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[16]  Загоровская Ольга Владимировна,et al.  Исследование влияния пола и психологических характеристик автора на количественные параметры его текста с использованием программы Linguistic Inquiry and Word Count , 2015 .

[17]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[18]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[19]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[20]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[21]  S. Blum,et al.  UNIVERSALS OF LEXICAL SIMPLIFICATION , 1978 .

[22]  Lucia Specia,et al.  QuEst - A translation quality estimation framework , 2013, ACL.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[27]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[28]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[29]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[30]  Hans van Halteren,et al.  Source Language Markers in EUROPARL Translations , 2008, COLING.

[31]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[32]  Philipp Koehn,et al.  Findings of the 2012 Workshop on Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[33]  Mona Baker,et al.  Text and technology : in honour of John Sinclair , 1993 .

[34]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[35]  Marius Popescu,et al.  Studying Translationese at the Character Level , 2011, RANLP.

[36]  Martin Gellerstam,et al.  Translationese in Swedish novels translated from English , 1986 .

[37]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[38]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[39]  Diana Inkpen,et al.  Searching for Poor Quality Machine Translated Text: Learning the Difference between Human Writing and Machine Translations , 2012, Canadian Conference on AI.

[40]  Cyril Goutte,et al.  Automatic Detection of Translated Text and its Impact on Machine Translation , 2009, MTSUMMIT.

[41]  Matt Post,et al.  Joshua 5.0: Sparser, Better, Faster, Server , 2013, WMT@ACL.

[42]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[43]  U. Germann Aligned Hansards of the 36th Parliament of Canada , 2001 .