Investigating Genre and Method Variation in Translation Using Text Classification

In this paper, we propose the use of automatic text classification methods to analyse variation in English-German translations from both a quantitative and a qualitative perspective. The experiments described in this paper are carried out in two steps. We trained classifiers to 1 discriminate between different genres fiction, political essays, etc.; and 2 identify the translation method machine vs. human. Using semi-delexicalized models excluding all nouns, we report results of up to 60.5% F-measure in distinguishing human and machine translations and 45.4% in discriminating between seven different genres. More than the classification performance itself, we argue that text classification methods can level out discriminative features of different variables genres and translation methods thus enabling researchers to investigate in more detail the properties of each of them.

[1]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[2]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[3]  Rico Sennrich,et al.  TerrorCat: a Translation Error Categorization-based MT Quality Metric , 2012, WMT@NAACL-HLT.

[4]  Peggy Cellier,et al.  What about Sequential Data Mining Techniques to Identify Linguistic Patterns for Stylistics? , 2012, CICLing.

[5]  Ekaterina Lapshinova-Koltunski VARTRA: A Comparable Corpus for Analysis of Translation Variation , 2013, BUCC@ACL.

[6]  Isabelle Delaere,et al.  Applying a multidimensional, register-sensitive approach to visualize normalization in translated and non-translated Dutch , 2013 .

[7]  Erich Steiner Translated Texts: Properties, Variants, Evaluations , 2004 .

[8]  Stefan Evert,et al.  A semi-supervised multivariate approach to the study of language variation , 2012 .

[9]  Juliane House,et al.  Translation quality assessment , 1977 .

[10]  Peter Wittenburg,et al.  Automatic sign language identification , 2013, 2013 IEEE International Conference on Image Processing.

[11]  Tajvidi Gh.R.,et al.  TRANSLATION QUALITY ASSESSMENT , 2005 .

[12]  Alexander Mehler,et al.  Riding the Rough Waves of Genre on the Web , 2011, Genres on the Web.

[13]  Stella Neumann,et al.  Contrastive Register Variation: A Quantitative Approach to the Comparison of English and German , 2013, Modern Language Review.

[14]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[15]  Bonnie Webber,et al.  Robust cross-lingual genre classification through comparable corpora , 2012 .

[16]  Erich Steiner An Extended Register Analysis as a Form of Text Analysis for Translation , 1997 .

[17]  Chris Callison-Burch,et al.  Using Comparable Corpora to Adapt MT Models to New Domains , 2014, WMT@ACL.

[18]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese : Machine-learning the Difference between , 2006 .

[19]  Dragos Stefan Munteanu,et al.  Measuring Machine Translation Errors in New Domains , 2013, TACL.

[20]  Shuly Wintner,et al.  On the features of translationese , 2015, Digit. Scholarsh. Humanit..

[21]  Liviu P. Dinu,et al.  Temporal Text Ranking and Automatic Dating of Texts , 2014, EACL.

[22]  Benjamin William Medlock,et al.  Investigating classification for natural language processing tasks , 2008 .

[23]  Erich Steiner A register-based translation evaluation: An advertisement as a case in point , 1998 .

[24]  Koen Plevoets,et al.  Lexical lectometry in corpus-based translation studies: combining profile-based correspondence analysis and logistic regression modeling , 2012 .

[25]  Marcos Zampieri,et al.  VarClass: An Open-source Language Identification Tool for Language Varieties , 2014, LREC.

[26]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[27]  Erich Steiner,et al.  Cross-Linguistic Corpora for the Study of Translations: Insights from the Language Pair English-German , 2012 .

[28]  M. Halliday,et al.  Language, Context, and Text: Aspects of Language in a Social-Semiotic Perspective , 1989 .

[29]  Bergljot Behrens,et al.  Hansen-Schirra Silvia, Stella Neumann and Erich Steiner (eds.) Cross-Linguistic Corpora for the Study of Translations: Insights from the Language Pair English-German , 2014 .

[30]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[31]  Liviu P. Dinu,et al.  A Quantitative Insight into the Impact of Translation on Readability , 2014, PITR@EACL.

[32]  Martin Gellerstam,et al.  Translationese in Swedish novels translated from English , 1986 .

[33]  J. House Translation quality assessment: A model revisited , 1997 .

[34]  Hermann Ney,et al.  Towards Automatic Error Analysis of Machine Translation Output , 2011, CL.

[35]  Peter Wittenburg,et al.  Improving Native Language Identification with TF-IDF Weighting , 2013, BEA@NAACL-HLT.

[36]  Douglas Biber,et al.  Dimensions of Register Variation , 1995 .

[37]  Diana Inkpen,et al.  Identification of Translationese: A Machine Learning Approach , 2010, CICLing.