Evaluating authorship distance methods using the positive Silhouette coefficient

Abstract Unsupervised Authorship Analysis (UAA) aims to cluster documents by authorship without knowing the authorship of any documents. An important factor in UAA is the method for calculating the distance between documents. This choice of the authorship distance method is considered more critical to the end result than the choice of cluster analysis algorithm. One method for measuring the correlation between a distance metric and a labelling (such as class values or clusters) is the Silhouette Coefficient (SC). The SC can be leveraged by measuring the correlation between the authorship distance method and the true authorship, evaluating the quality of the distance method. However, we show that the SC can be severely affected by outliers. To address this issue, we introduce the Positive Silhouette Coefficient, given as the proportion of instances with a positive SC value. This metric is not easily altered by outliers and produces a more robust metric. A large number of authorship distance methods are then compared using the PSC, and the findings are presented. This research provides an insight into the efficacy of methods for UAA and presents a framework for testing authorship distance methods.

[1]  H. Pollard On the Relative Stability of the Median and Arithmetic Mean, with Particular Reference to Certain Frequency Distributions Which Can Be Dissected into Normal Distributions , 1934 .

[2]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[5]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[7]  van Gerardus Noord,et al.  Special issue: finite state methods in natural language processing , 2003 .

[8]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[9]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[10]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[11]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[12]  Efstathios Stamatatos,et al.  Authorship Attribution Based on Feature Set Subspacing Ensembles , 2006, Int. J. Artif. Intell. Tools.

[13]  Mario Vento,et al.  A Graph-Based Clustering Method and Its Applications , 2007, BVAI.

[14]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[15]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[16]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[17]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[18]  Louise Guthrie,et al.  Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation , 2008, LREC.

[19]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[20]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[21]  Thamar Solorio,et al.  Authorship attribution of web forum posts , 2010, 2010 eCrime Researchers Summit.

[22]  Paul A. Watters,et al.  Automatically determining phishing campaigns using the USCAP methodology , 2010, 2010 eCrime Researchers Summit.

[23]  Ana L. N. Fred,et al.  On Consensus Clustering Validation , 2010, SSPR/SPR.

[24]  Paul A. Watters,et al.  Recentred local profiles for authorship attribution , 2011, Natural Language Engineering.

[25]  Kuldeep Kumar,et al.  Robust Statistics, 2nd edn , 2011 .

[26]  Paul A. Watters,et al.  Automated unsupervised authorship analysis using evidence accumulation clustering , 2011, Natural Language Engineering.

[27]  Authorship attribution on the Enron Email Corpus , 2013 .