Author Clustering with an Adaptive Threshold

This paper describes and evaluates an unsupervised author clustering model called Spatium. The proposed strategy can be adapted without any difficulty to different natural languages (such as Dutch, English, and Greek) and it can be applied to different text genres (newspaper articles, reviews, excerpts of novels, etc.). As features, we suggest using the m most frequent terms of each text (isolated words and punctuation symbols with m set to at most 200). Applying a distance measure, we define whether there is enough evidence that two texts were written by the same author. The evaluations are based on six test collections (PAN Author Clustering task at CLEF 2016). A more detailed analysis shows the strengths of our approach but also indicates the problems and provides reasons for some of the potential failures of the Spatium model.

[1]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[2]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[3]  Paul A. Watters,et al.  Evaluating authorship distance methods using the positive Silhouette coefficient , 2012, Natural Language Engineering.

[4]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[5]  Jacques Savoy,et al.  Author Clustering Using SPATIUM , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[6]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[7]  Jacques Savoy,et al.  Estimating the probability of an authorship attribution , 2016, J. Assoc. Inf. Sci. Technol..

[8]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Jacques Savoy,et al.  Distance measures in author profiling , 2017, Information Processing & Management.

[11]  Lukas Christian Erne [Review of:] Shakespeare, Computers, and the Mystery of Authorship (Cambridge, 2009) / Hugh Craig and Arthur F. Kinney (eds.) , 2010 .

[12]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[13]  Benno Stein,et al.  Clustering by Authorship Within and Across Documents , 2016, CLEF.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Jacques Savoy,et al.  A simple and efficient algorithm for authorship verification , 2017, J. Assoc. Inf. Sci. Technol..

[16]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[17]  Mónica Bécue-Bertaut,et al.  How scientific literature has been evolving over the time? A novel statistical approach using tracking verbal-based methods , 2016, ArXiv.