Authorship Attribution with Author-aware Topic Models

Authorship attribution deals with identifying the authors of anonymous texts. Building on our earlier finding that the Latent Dirichlet Allocation (LDA) topic model can be used to improve authorship attribution accuracy, we show that employing a previously-suggested Author-Topic (AT) model outperforms LDA when applied to scenarios with many authors. In addition, we define a model that combines LDA and AT by representing authors and documents over two disjoint topic sets, and show that our model outperforms LDA, AT and support vector machines on datasets with many authors.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[3]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[4]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..

[6]  Michael Gamon,et al.  Obfuscating Document Stylometry to Preserve Author Anonymity , 2006, ACL.

[7]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[8]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[9]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[10]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[11]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[12]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[13]  Agner Fog,et al.  Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution , 2008, Commun. Stat. Simul. Comput..

[14]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[15]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[16]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[17]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[18]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[19]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[20]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[21]  Eric P. Xing,et al.  Conditional Topic Random Fields , 2010, ICML.

[22]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[23]  T. Groves Is open peer review the fairest system? Yes , 2010, BMJ : British Medical Journal.

[24]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[25]  Ingrid Zukerman,et al.  Authorship Attribution with Latent Dirichlet Allocation , 2011, CoNLL.

[26]  Efstathios Stamatatos,et al.  Author Identification Using Semi-supervised Learning - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[27]  Andrew Warfield,et al.  Herbert West - Deanonymizer , 2011, HotSec.

[28]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[29]  Shlomo Argamon,et al.  Overview of the International Authorship Identification Competition at PAN-2011 , 2011, CLEF.

[30]  Ludovic Tanguy,et al.  A Multitude of Linguistically-rich Features for Authorship Attribution - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[31]  C. Seifert,et al.  Vote/Veto Meta-Classifier for Authorship Identification - Notebook for PAN at CLEF 2011 , 2011, CLEF.