Detecting authorship deception: a supervised machine learning approach using author writeprints

We describe a new supervised machine learning approach for detecting author- ship deception, a specific type of authorship attribution task particularly relevant for cybercrime forensic investigations, and demonstrate its validity on two case studies drawn from realistic online data sets. The core of our approach involves identifying uncharacteristic behavior for an author, based on a writeprint ex- tracted from unstructured text samples of the author's writing. The writeprints used here involve stylometric features and content features derived from topic models, an unsupervised approach for identifying relevant keywords that relate to the content areas of a document. One innovation of our approach is to trans- form the writeprint feature values into a representation that individually balances characteristic and uncharacteristic traits of an author, and we subsequently apply a Sparse Multinomial Logistic Regression classifier to this novel representation. Our method yields high accuracy for authorship deception detection on the two case studies, confirming its utility.

[1]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[2]  Akshay Java,et al.  The ICWSM 2009 Spinn3r Dataset , 2009 .

[3]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[4]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[5]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[6]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[7]  Patrick Juola,et al.  The Time Course of Language Change , 2003, Comput. Humanit..

[8]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[9]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[10]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[11]  John Burrows,et al.  Questions of Authorship: Attribution and Beyond A Lecture Delivered on the Occasion of the Roberto Busa Award ACH-ALLC 2001, New York , 2003, Comput. Humanit..

[12]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[13]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[14]  Andrew S. Gordon Story Management Technologies for Organizational Learning , 2008 .

[15]  Patrick Juola,et al.  QUESTIONED ELECTRONIC DOCUMENTS : EMPIRICAL STUDIES IN AUTHORSHIP ATTRIBUTION , 2006 .

[16]  John F. Burrows,et al.  ‘An ocean where each kind. . .’: Statistical analysis and some major determinants of literary style , 1989, Comput. Humanit..

[17]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[18]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[20]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[21]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[22]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[23]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[24]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[25]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[26]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[27]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[29]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[30]  Weiguo Fan,et al.  Tapping the power of text mining , 2006, CACM.

[31]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[32]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.