Linguistic Features of Genre and Method Variation in Translation: A Computational Perspective

In this paper we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this paper is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation.

[1]  Barbara McGillivray,et al.  Multivariate analyses of affix productivity in translated English , 2012 .

[2]  Koen Plevoets,et al.  Lexical lectometry in corpus-based translation studies: combining profile-based correspondence analysis and logistic regression modeling , 2012 .

[3]  Marcos Zampieri,et al.  VarClass: An Open-source Language Identification Tool for Language Varieties , 2014, LREC.

[4]  Isabelle Delaere,et al.  7 Exploratory analysis of dimensions influencing variation in translation. The case of text register and translation method , 2017 .

[5]  Michael Gamon,et al.  A Machine Learning Approach to the Automatic Evaluation of Machine Translation , 2001, ACL.

[6]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[7]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[8]  Erich Steiner Translated Texts: Properties, Variants, Evaluations , 2004 .

[9]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[10]  Elisabet Comelles,et al.  VERTa participation in the WMT14 Metrics Task , 2014 .

[11]  Erich Steiner,et al.  Cross-Linguistic Corpora for the Study of Translations: Insights from the Language Pair English-German , 2012 .

[12]  Juliane House,et al.  Translation Quality Assessment: Past and Present , 2014 .

[13]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[14]  Rico Sennrich,et al.  TerrorCat: a Translation Error Categorization-based MT Quality Metric , 2012, WMT@NAACL-HLT.

[15]  Meritxell Gonz IPA and STOUT: Leveraging Linguistic and Source-based Features for Machine Translation Evaluation , 2014 .

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Josef van Genabith,et al.  Re-assessing the WMT2013 Human Evaluation with Professional Translators Trainees , 2015, EAMT.

[18]  Isabelle Delaere,et al.  Applying a multidimensional, register-sensitive approach to visualize normalization in translated and non-translated Dutch , 2013 .

[19]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese : Machine-learning the Difference between , 2006 .

[20]  Mahmoud El-Haj,et al.  Language Independent Evaluation of Translation Style and Consistency: Comparing Human and Machine Translations of Camus' Novel "The Stranger" , 2014, TSD.

[21]  Josef van Genabith,et al.  ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks , 2015, EMNLP.

[22]  M. Halliday,et al.  Language, Context, and Text: Aspects of Language in a Social-Semiotic Perspective , 1989 .

[23]  D. Biber,et al.  Longman Grammar of Spoken and Written English , 1999 .

[24]  Khalil Sima'an,et al.  BEER: BEtter Evaluation as Ranking , 2014, WMT@ACL.

[25]  Lidun Hareide,et al.  A multidimensional approach to aligned sentences in translated text , 2013 .

[26]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[27]  Mihaela Vela,et al.  Measuring ‘Registerness’ in Human and Machine Translation: A Text Classification Approach , 2015, DiscoMT@EMNLP.

[28]  Haidee Kruger,et al.  Register and the features of translated language , 2012 .

[29]  Elke Teich,et al.  Cross-linguistic variation in system and text , 2003 .

[30]  Dragos Stefan Munteanu,et al.  Measuring Machine Translation Errors in New Domains , 2013, TACL.

[31]  Hermann Ney,et al.  Towards Automatic Error Analysis of Machine Translation Output , 2011, CL.

[32]  Marcos Zampieri,et al.  Investigating Genre and Method Variation in Translation Using Text Classification , 2015, TSD.

[33]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[34]  Douglas Biber,et al.  Dimensions of Register Variation , 1995 .

[35]  Peter Wittenburg,et al.  Improving Native Language Identification with TF-IDF Weighting , 2013, BEA@NAACL-HLT.

[36]  Ekaterina Lapshinova-Koltunski VARTRA: A Comparable Corpus for Analysis of Translation Variation , 2013, BUCC@ACL.

[37]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[38]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[39]  Alexander Mehler,et al.  Riding the Rough Waves of Genre on the Web , 2011, Genres on the Web.

[40]  Liviu P. Dinu,et al.  Temporal Text Ranking and Automatic Dating of Texts , 2014, EACL.

[41]  Benjamin William Medlock,et al.  Investigating classification for natural language processing tasks , 2008 .

[42]  Chris Callison-Burch,et al.  Using Comparable Corpora to Adapt MT Models to New Domains , 2014, WMT@ACL.

[43]  Shervin Malmasi,et al.  LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles , 2016, *SEMEVAL.

[44]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[45]  Chengqing Zong,et al.  Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora , 2008, COLING.

[46]  Stella Neumann,et al.  Contrastive Register Variation: A Quantitative Approach to the Comparison of English and German , 2013, Modern Language Review.

[47]  Bogdan Babych,et al.  Modelling Legitimate Translation Variation for Automatic Evaluation of MT Quality , 2004, LREC.

[48]  Erich Steiner,et al.  5 A characterization of the resource based on shallow statistics , 2012 .

[49]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[50]  Stefan Evert,et al.  A semi-supervised multivariate approach to the study of language variation , 2012 .

[51]  Juliane House,et al.  Translation quality assessment , 1977 .

[52]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .