Authorship Attribution with Latent Dirichlet Allocation

The problem of authorship attribution -- attributing texts to their original authors -- has been an active research area since the end of the 19th century, attracting increased interest in the last decade. Most of the work on authorship attribution focuses on scenarios with only a few candidate authors, but recently considered cases with tens to thousands of candidate authors were found to be much more challenging. In this paper, we propose ways of employing Latent Dirichlet Allocation in authorship attribution. We show that our approach yields state-of-the-art performance for both a few and many candidate authors, in cases where these authors wrote enough texts to be modelled effectively.

[1]  Ingrid Zukerman,et al.  A Hierarchical Classifier Applied to Multi-way Sentiment Detection , 2010, COLING.

[2]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[3]  M. N. Murty,et al.  Stopwords and Stylometry : A Latent Dirichlet Allocation Approach , 2009 .

[4]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[5]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[6]  Rong Jin,et al.  Localized Support Vector Machine and Its Efficient Algorithm , 2007, SDM.

[7]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[8]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[9]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[10]  Ingrid Zukerman,et al.  Personalised rating prediction for new users using latent factor models , 2011, HT '11.

[11]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[12]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Ingrid Zukerman,et al.  Collaborative Inference of Sentiments from Texts , 2010, UMAP.

[14]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[15]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[16]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[17]  Shlomo Argamon,et al.  Author Identification on the Large Scale , 2005 .

[18]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[19]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[20]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..