Hierarchical Clustering Analysis: The Best-Performing Approach at PAN 2017 Author Clustering Task

The author clustering problem consists in grouping documents written by the same author so that each group corresponds to a different author. We described our approach to the author clustering task at PAN 2017, which resulted in the best-performing system at the aforementioned task. Our method performs a hierarchical clustering analysis using document features such as typed and untyped character n-grams, word n-grams, and stylometric features. We experimented with two feature representation methods, log-entropy model, and TF-IDF, while tuning minimum frequency threshold values to reduce the feature dimensionality. We identified the optimal number of different clusters (authors) dynamically for each collection using the Calinski Harabasz score. The implementation of our system is available open source (https://github.com/helenpy/clusterPAN2017).

[1]  Alexander F. Gelbukh,et al.  Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition , 2015, CLEF.

[2]  Douglas Bagnall,et al.  Authorship Clustering using Multi-headed Recurrent Neural Networks , 2016, CLEF.

[3]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[4]  Brandon Pincombe,et al.  Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus , 2004 .

[5]  Paul A. Watters,et al.  Automated unsupervised authorship analysis using evidence accumulation clustering , 2011, Natural Language Engineering.

[6]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[7]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[8]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[9]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[10]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[11]  Efstathios Stamatatos,et al.  Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing , 2017, CICLing.

[12]  Grigori Sidorov,et al.  Application of the distributed document representation in the authorship attribution task for small corpora , 2017, Soft Comput..

[13]  Benno Stein,et al.  Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation , 2016, CLEF.

[14]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[15]  Darnes Vilariño Ayala,et al.  Author Clustering using Hierarchical Clustering Analysis , 2017, CLEF.

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .