Author Clustering using Hierarchical Clustering Analysis

This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two feature representation methods, log-entropy model, and tf-idf; while tuning minimum frequency threshold values to reduce the dimensionality. Our system was ranked 1 in both subtasks, author clustering and authorship-link ranking.

[1]  Alexander F. Gelbukh,et al.  Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition , 2015, CLEF.

[2]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[3]  Douglas Bagnall,et al.  Authorship Clustering using Multi-headed Recurrent Neural Networks , 2016, CLEF.

[4]  Preslav Nakov,et al.  Experiments in Authorship-Link Ranking and Complete Author Clustering , 2016, CLEF.

[5]  Mark Stevenson,et al.  Exploring Word Embeddings and Character N-Grams for Author Clustering , 2016, CLEF.

[6]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[7]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[8]  Mirco Kocher UniNE at CLEF 2016: Author Clustering , 2016, CLEF.

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[11]  Helena Gómez-Adorno,et al.  Language- and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling , 2017, CLEF.

[12]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[13]  Steven Bethard,et al.  Not All Character N-grams Are Created Equal: A Study in Authorship Attribution , 2015, NAACL.

[14]  Benno Stein,et al.  Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation , 2016, CLEF.

[15]  Grigori Sidorov,et al.  Application of the distributed document representation in the authorship attribution task for small corpora , 2017, Soft Comput..

[16]  Brandon Pincombe,et al.  Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus , 2004 .

[17]  Benno Stein,et al.  Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.

[18]  Paul A. Watters,et al.  Automated unsupervised authorship analysis using evidence accumulation clustering , 2011, Natural Language Engineering.

[19]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[20]  Efstathios Stamatatos,et al.  Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing , 2017, CICLing.

[21]  Victor M. Darriba,et al.  Computational Linguistics and Intelligent Text Processing , 2014, Lecture Notes in Computer Science.