Author Clustering based on Compression-based Dissimilarity Scores

The PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish authorship links between documents in a cluster and provide a list of document pairs, ranked according to a confidence score. We present a simple scheme to handle both scenarios. In order to group the documents by their authors, we use k-Medoids, where the optimal k is determined through the computation of silhouettes. To determine links between the documents in each cluster, we apply a predefined compressor as well as a dissimilarity measure. The resulting compression-based dissimilarity scores are then used to rank all document pairs. The proposed scheme does not require (text-)preprocessing, feature engineering or hyperparameter optimization, which are often necessary in author clustering and/or other related fields. However, the achieved results indicate that there is room for improvement.

[1]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[2]  David L. Dowe,et al.  Statistical compression-based models for text classification , 2016, 2016 Fifth International Conference on Eco-friendly Computing and Communication Systems (ICECCS).

[3]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[4]  A. Vinaya Babu,et al.  Authorship Attribution based on Data Compression for Telugu Text , 2015 .

[5]  Cor J. Veenman,et al.  Authorship Verification with Compression Features , 2013, CLEF.

[6]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[7]  W. Oliveira,et al.  Comparing compression models for authorship attribution. , 2013, Forensic science international.

[8]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[9]  Mário A. T. Figueiredo,et al.  Text Classification Using Compression-Based Dissimilarity Measures , 2015, Int. J. Pattern Recognit. Artif. Intell..

[10]  Douglas Bagnall,et al.  Author Identification Using Multi-headed Recurrent Neural Networks , 2015, CLEF.

[11]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[12]  Mihai Datcu,et al.  Authorship analysis based on data compression , 2014, Pattern Recognit. Lett..

[13]  Christian Winter,et al.  Authorship Verification based on Compression-Models , 2017, ArXiv.

[14]  P. Rousseeuw,et al.  Partitioning Around Medoids (Program PAM) , 2008 .

[15]  Benno Stein,et al.  Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.