Proximity Based One-class Classification with Common N-Gram Dissimilarity for Authorship Verification Task Notebook for PAN at CLEF 2013

We describe our participation in the Author Identification task of the PAN 2013 competition. This competition task presents participants with a set of authorship verification problems. In each such a problem, one is given a set of documents written by one author and a sample document; the task is to answer the question whether or not the sample document was written by the same author as the remaining documents. We approach this problem by proposing a proximity based method for one-class classification (based on an idea similar to the k -center boundary method) that applies the Common N-Gram (CNG) dissimilarity mea- sure. The CNG dissimilarity is based on the differences in the frequencies of the character n-grams that are most common in the considered documents. Our method compares the dissimilarity between the sample document and each doc- ument from the target set of documents of known authorship to the maximum dissimilarity between this target document and all other documents from the set; thresholding is applied to arrive at the classification of the sample documen t. Our method yielded F1 of 0.659 on the whole competition test dataset and the com- petition ranking 5th (shared) of 18 (according to the results announced on June 12, 2013).