Domain Independent Authorship Attribution without Domain Adaptation

Automatic authorship attribution, by its nature, is much more advantageous if it is domain (i.e., topic and/or genre) independent. That is, many real world problems that require authorship attribution may not have in-domain training data readily available. However, most previous work based on machine learning techniques focused only on in-domain text for authorship attribution. In this paper, we present comprehensive evaluation of various stylometric techniques for cross-domain authorship attribution. From the experiments based on the Project Gutenberg book archive, we discover that extremely simple techniques based on stopwords are surprisingly robust against domain change, essentially ridding the need for domain adaptation when supplied with a large amount of data.

[1]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[2]  Efstathios Stamatatos,et al.  Automatic Authorship Attribution , 1999, EACL.

[3]  George K. Mikros,et al.  Investigating Topic Influence in Authorship Attribution , 2007, PAN.

[4]  Dan Klein,et al.  A* Parsing: Fast Exact Viterbi Parse Selection , 2003, NAACL.

[5]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[6]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[7]  Koby Crammer,et al.  Learning Bounds for Domain Adaptation , 2007, NIPS.

[8]  Antonio Miranda García,et al.  Function Words in Authorship Attribution Studies , 2007, Lit. Linguistic Comput..

[9]  Banu Diri,et al.  Author Attribution of Turkish Texts by Feature Mining , 2007, ICIC.

[10]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[11]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  C. E. Veni Madhavan,et al.  Stopword Graphs and Authorship Attribution in Text Corpora , 2009, 2009 IEEE International Conference on Semantic Computing.

[14]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[15]  Adriana Kovashka,et al.  Authorship Attribution Using Probabilistic Context-Free Grammars , 2010, ACL.