论文信息 - Author Clustering based on Compression-based Dissimilarity Scores

Author Clustering based on Compression-based Dissimilarity Scores

The PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish authorship links between documents in a cluster and provide a list of document pairs, ranked according to a confidence score. We present a simple scheme to handle both scenarios. In order to group the documents by their authors, we use k-Medoids, where the optimal k is determined through the computation of silhouettes. To determine links between the documents in each cluster, we apply a predefined compressor as well as a dissimilarity measure. The resulting compression-based dissimilarity scores are then used to rank all document pairs. The proposed scheme does not require (text-)preprocessing, feature engineering or hyperparameter optimization, which are often necessary in author clustering and/or other related fields. However, the achieved results indicate that there is room for improvement.

Oren Halvani | Lukas Graner | Oren Halvani | L. Graner

[1] Ning Wu,et al. On Compression-Based Text Classification , 2005, ECIR.

[2] David L. Dowe,et al. Statistical compression-based models for text classification , 2016, 2016 Fifth International Conference on Eco-friendly Computing and Communication Systems (ICECCS).

[3] Efstathios Stamatatos,et al. Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[4] A. Vinaya Babu,et al. Authorship Attribution based on Data Compression for Telugu Text , 2015 .

[5] Cor J. Veenman,et al. Authorship Verification with Compression Features , 2013, CLEF.

[6] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[7] W. Oliveira,et al. Comparing compression models for authorship attribution. , 2013, Forensic science international.

[8] Peter J. Rousseeuw,et al. Clustering by means of medoids , 1987 .

[9] Mário A. T. Figueiredo,et al. Text Classification Using Compression-Based Dissimilarity Measures , 2015, Int. J. Pattern Recognit. Artif. Intell..

[10] Douglas Bagnall,et al. Author Identification Using Multi-headed Recurrent Neural Networks , 2015, CLEF.

[11] Carla E. Brodley,et al. Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[12] Mihai Datcu,et al. Authorship analysis based on data compression , 2014, Pattern Recognit. Lett..

[13] Christian Winter,et al. Authorship Verification based on Compression-Models , 2017, ArXiv.

[14] P. Rousseeuw,et al. Partitioning Around Medoids (Program PAM) , 2008 .

[15] Benno Stein,et al. Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.