Author Clustering Using SPATIUM

This paper presents the author clustering problem and compares it to related authorship attribution questions. The proposed model is based on a distance measure called Spatium derived from the Canberra measure (weighted version of L1 norm). The selected features consist of the 200 most frequent words and punctuation symbols. An evaluation methodology is presented and the test collections are extracted from the PAN CLEF 2016 evaluation campaign. In addition to those, we also consider two additional corpora reflecting the literature domain more closely. Based on four different languages, the evaluation measures demonstrate a high precision and F1 for all 20 test collections. A more detailed analysis provides reasons explaining some of the failures of the Spatium model.

[1]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[2]  Mike Kestemont,et al.  Computational authorship verification method attributes a new work to a major 2nd century African author , 2015, J. Assoc. Inf. Sci. Technol..

[3]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[5]  Paul A. Watters,et al.  Evaluating authorship distance methods using the positive Silhouette coefficient , 2012, Natural Language Engineering.

[6]  J. Pennebaker,et al.  The Secret Life of Pronouns , 2003, Psychological science.

[7]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[8]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[9]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[10]  J. Pennebaker The Secret Life of Pronouns: What Our Words Say About Us , 2011 .

[11]  Cyril Labbé,et al.  A Tool for Literary Studies: Intertextual Distance and Tree Classification , 2005, Lit. Linguistic Comput..

[12]  Jacques Savoy,et al.  A simple and efficient algorithm for authorship verification , 2017, J. Assoc. Inf. Sci. Technol..

[13]  Benno Stein,et al.  Clustering by Authorship Within and Across Documents , 2016, CLEF.

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  John Burrows,et al.  All the Way Through: Testing for Authorship in Different Frequency Strata , 2007, Lit. Linguistic Comput..

[17]  Hugh Craig,et al.  Shakespeare, Computers, and the Mystery of Authorship: Contents , 2009 .

[18]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[19]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[20]  Gene Tsudik,et al.  Exploring Linkability of User Reviews , 2012, ESORICS.

[21]  Lukas Christian Erne [Review of:] Shakespeare, Computers, and the Mystery of Authorship (Cambridge, 2009) / Hugh Craig and Arthur F. Kinney (eds.) , 2010 .

[22]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .