UniNE at CLEF 2017: Author Clustering

This paper describes and evaluates an effective unsupervised author clustering and authorship linking model called SPATIUM. The suggested strategy can be adapted without any difficulty to different languages (such as Dutch, English, and Greek) in different text genres (e.g., newspaper articles and reviews). As features, we suggest using the m most frequent terms (isolated words and punctuation symbols) or the m most frequent character n-grams of each text. Applying a simple distance measure, we determine whether there is enough indication that two texts were written by the same author. The evaluations are based on 60 training and 120 test problems (PAN AUTHOR CLUSTERING task at CLEF 2017). Using the most frequent terms results in a higher clustering precision, while using the most frequent character n-grams of letters gives a higher clustering recall. An analysis to assess the variability of the performance measures indicates that we have a system working stable independent of the underlying text collection and that our parameter choices did not over-fit to the training data.

[1]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[2]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[3]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[4]  Jacques Savoy,et al.  Author Clustering with an Adaptive Threshold , 2017, CLEF.

[5]  Jacques Savoy,et al.  Estimating the probability of an authorship attribution , 2016, J. Assoc. Inf. Sci. Technol..

[6]  Mirco Kocher UniNE at CLEF 2016: Author Clustering , 2016, CLEF.

[7]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[8]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[9]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[10]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[11]  Jacques Savoy,et al.  Distance measures in author profiling , 2017, Information Processing & Management.

[12]  Benno Stein,et al.  Clustering by Authorship Within and Across Documents , 2016, CLEF.

[13]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .