Using Relative Entropy for Authorship Attribution

Authorship attribution is the task of deciding who wrote a particular document. Several attribution approaches have been proposed in recent research, but none of these approaches is particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and efficiency. In this paper, we propose a principled approach motivated from information theory to identify authors based on elements of writing style. We make use of the Kullback-Leibler divergence, a measure of how different two distributions are, and explore several different approaches to tokenizing documents to extract style markers. We use several data collections to examine the performance of our approach. We have found that our proposed approach is as effective as the best existing attribution methods for two class attribution, and is superior for multi-class attribution. It has lower computational cost and is cheaper to train. Finally, our results suggest this approach is a promising alternative for other categorization problems.

[1]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[2]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[3]  Bryant W. York,et al.  Proceedings of the 2003 conference on Diversity in computing , 2003 .

[4]  Vladimir Vapnik,et al.  Support Vector Machine for Text Categorization , 1998 .

[5]  Roxanna Paez,et al.  Stephen Crane and the New-York Tribune: A Case Study in Traditional and Non-Traditional Authorship Attribution , 2001, Comput. Humanit..

[6]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[7]  Justin Zobel,et al.  Effective and Scalable Authorship Attribution Using Function Words , 2005, AIRS.

[8]  Joshua Goodman Extended Comment on Language Trees and Zipping , 2002, ArXiv.

[9]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[10]  Efstathios Stamatatos,et al.  Automatic Authorship Attribution , 1999, EACL.

[11]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[12]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[13]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[14]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[15]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[16]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[17]  Dale Schuurmans,et al.  Language and Task Independent Text Categorization with Simple Language Models , 2003, NAACL.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[20]  Michael A. Shepherd,et al.  Support vector machines for text categorization , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[21]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[22]  J. Binongo Does only “ The Oz ” himself know who wrote The Royal Book of Oz ? Who Wrote the 15 th Book of Oz ? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[23]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[24]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[25]  Glenn Fung,et al.  The disputed federalist papers: SVM feature selection via concave minimization , 2003, TAPIA '03.

[26]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[27]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[28]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[29]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.