Cross-Language Authorship Attribution

This paper presents a novel task of cross-language authorship attribution (CLAA), an extension of authorship attribution task to multilingual settings: given data labelled with authors in language X, the objective is to determine the author of a document written in language Y , where X is different from Y . We propose a number of cross-language stylometric features for the task of CLAA, such as those based on sentiment and emotional markers. We also explore an approach based on machine translation (MT) with both lexical and cross-language features. We experimentally show that MT could be used as a starting point to CLAA, since it allows good attribution accuracy to be achieved. The cross-language features provide acceptable accuracy while using jointly with MT, though do not outperform lexical features.

[1]  Carlo Strapparava,et al.  WordNet Affect: an Affective Extension of WordNet , 2004, LREC.

[2]  Ivan Titov,et al.  Inducing Crosslingual Distributed Representations of Words , 2012, COLING.

[3]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[4]  Verónica Pérez-Rosas,et al.  Learning Sentiment Lexicons in Spanish , 2012, LREC.

[5]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[8]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[9]  Lei Shi,et al.  Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[10]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[11]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[12]  Moshe Koppel,et al.  Markers of translator gender: Do they really matter? , 2010 .

[13]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Parth Gupta,et al.  Cross-Language Plagiarism Detection Using a Multilingual Semantic Network , 2013, ECIR.

[16]  R. Dunn,et al.  Survey of research on learning styles , 1989 .

[17]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[18]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[19]  Xiaojun Wan,et al.  Bilingual Co-Training for Sentiment Classification of Chinese Product Reviews , 2011, CL.

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Piek Vossen,et al.  The MEANING Multilingual Central Repository , 2004 .

[22]  Graeme Hirst,et al.  Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts , 2007, Lit. Linguistic Comput..

[23]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[24]  Jakob Uszkoreit,et al.  Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[25]  Paolo Rosso,et al.  Personal Sense and Idiolect: Combining Authorship Attribution and Opinion Analysis , 2010, LREC.