Measuring Differentiability: Unmasking Pseudonymous Authors

In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the "depth of difference" between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process.

[1]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[2]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[3]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.

[4]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[5]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[6]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[7]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[8]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[9]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[10]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[11]  José Nilo G. Binongo,et al.  Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[12]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[15]  David M. J. Tax,et al.  One-class classification , 2001 .

[16]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[17]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[20]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[21]  Christine Wilson,et al.  A Widow and her Soldier: Stylometry and the American Civil War , 2001, Lit. Linguistic Comput..

[22]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[23]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[24]  George M. Mohay,et al.  E-Mail Authorship Attribution for Computer Forensics , 2002, Applications of Data Mining in Computer Security.