Authorship Verification with Compression Features

In the PAN 2013 Author Identification task, the problem was to verify whether a document was written by the same author as a small set of given ref- erence documents. We approached the problem as a classification task for which the reference documents are from the target class. We further collected a set of documents for the non-target or outlier class. For this classification problem we prepared three submissions for the English Authorship Verification task. The first submission applies the nearest neighbor rule using compression distances from the "questioned" document to the reference and outlier documents. The second and third submission utilize a document representation with compression dis- tances to random prototype documents. In the resulting prototype space the Low- est Error in Sparse Subspace (LESS) classifier is applied. The third submission additionally uses document resampling or bootstrapping to mitigate the small sample problem in case the number of reference documents is low. The evalu- ation result of our submission achieved the best performance of 16 teams with precision = 0:80 , recall = 0:80 and F1 = 0:80.

[1]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[2]  Cor J. Veenman,et al.  Forensic Authorship Attribution Using Compression Distances to Prototypes , 2009, IWCF.

[3]  Carla E. Brodley,et al.  Compression and machine learning: a new perspective on feature space vectors , 2006, Data Compression Conference (DCC'06).

[4]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[5]  Cor J. Veenman,et al.  Bootstrapped Authorship Attribution in Compression Space , 2012, CLEF.

[6]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[7]  Cor J. Veenman,et al.  LESS: a model-based classifier for sparse subspaces , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.