Detecting Syntactic Features of Translated Chinese

We present a machine learning approach to distinguish texts translated to Chinese (by humans) from texts originally written in Chinese, with a focus on a wide range of syntactic features. Using Support Vector Machines (SVMs) as classifier on a genre-balanced corpus in translation studies of Chinese, we find that constituent parse trees and dependency triples as features without lexical information perform very well on the task, with an F-measure above 90%, close to the results of lexical n-gram features, without the risk of learning topic information rather than translation features. Thus, we claim syntactic features alone can accurately distinguish translated from original Chinese. Translated Chinese exhibits an increased use of determiners, subject position pronouns, NP + 'de' as NP modifiers, multiple NPs or VPs conjoined by a Chinese specific punctuation, among other structures. We also interpret the syntactic features with reference to previous translation studies in Chinese, particularly the usage of pronouns.

[1]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[2]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[3]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[4]  Mark Dras,et al.  Exploiting Parse Structures for Native Language Identification , 2011, EMNLP.

[5]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[6]  Benjamin Swanson,et al.  Native Language Detection with Tree Substitution Grammars , 2012, ACL.

[7]  Nianwen Xue,et al.  Joint POS Tagging and Transition-based Constituent Parsing in Chinese with Non-local Features , 2014, ACL.

[8]  Yue Chen,et al.  IUCL at SemEval-2016 Task 6: An Ensemble Model for Stance Detection in Twitter , 2016, *SEMEVAL.

[9]  Anthony McEnery,et al.  The Lancaster Corpus of Mandarin Chinese: A Corpus for Monolingual and Contrastive Language Study , 2004, LREC.

[10]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[11]  Shuly Wintner,et al.  Language Models for Machine Translation: Original vs. Translated Texts , 2011, CL.

[12]  Federico Sangati,et al.  Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP , 2011, EMNLP.

[13]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..

[14]  Gideon Toury Interlanguage and its Manifestations in Translation. , 1979 .

[15]  Walt Detmar Meurers,et al.  Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization , 2014, COLING.

[16]  Matt Post,et al.  Bayesian Learning of a Tree Substitution Grammar , 2009, ACL.

[17]  Richard Xiao,et al.  Corpus-Based Studies of Translational Chinese in English-Chinese Translation , 2015 .

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Diana Inkpen,et al.  Identification of Translationese: A Machine Learning Approach , 2010, CICLing.

[20]  Matt Post,et al.  Explicit and Implicit Syntactic Features for Text Classification , 2013, ACL.

[21]  F. Mosteller,et al.  A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers , 2016 .

[22]  Shuly Wintner,et al.  On the features of translationese , 2015, Digit. Scholarsh. Humanit..

[23]  Nianwen Xue,et al.  A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing , 2013, ACL.

[24]  Hai Hu,et al.  Non-Deterministic Segmentation for Chinese Lattice Parsing , 2017, RANLP.

[25]  Joshua Goodman,et al.  Parsing Inside-Out , 1998, ArXiv.

[26]  Sandra Kübler,et al.  Feature Selection for Highly Skewed Sentiment Analysis Tasks , 2014, SocialNLP@COLING.

[27]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[28]  Hu Xianyao A corpus-based multi-dimensional analysis of the stylistic features of translated Chinese , 2010 .

[29]  Khalil Sima'an,et al.  Data-Oriented Parsing , 2003 .