We describe our participation in the Author Identification task of the PAN 2013 competition. This competition task presents participants with a set of authorship verification problems. In each such a problem, one is given a set of documents written by one author and a sample document; the task is to answer the question whether or not the sample document was written by the same author as the remaining documents. We approach this problem by proposing a proximity based method for one-class classification (based on an idea similar to the k -center boundary method) that applies the Common N-Gram (CNG) dissimilarity mea- sure. The CNG dissimilarity is based on the differences in the frequencies of the character n-grams that are most common in the considered documents. Our method compares the dissimilarity between the sample document and each doc- ument from the target set of documents of known authorship to the maximum dissimilarity between this target document and all other documents from the set; thresholding is applied to arrive at the classification of the sample documen t. Our method yielded F1 of 0.659 on the whole competition test dataset and the com- petition ranking 5th (shared) of 18 (according to the results announced on June 12, 2013).
[1]
Efstathios Stamatatos,et al.
Author Identification Using Imbalanced and Limited Training Texts
,
2007,
18th International Workshop on Database and Expert Systems Applications (DEXA 2007).
[2]
Patrick Juola,et al.
Authorship Attribution
,
2008,
Found. Trends Inf. Retr..
[3]
David M. J. Tax,et al.
One-class classification
,
2001
.
[4]
Fuchun Peng,et al.
N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION
,
2003
.
[5]
Efstathios Stamatatos.
Author Identification Using Imbalanced and Limited Training Texts
,
2007
.
[6]
Efstathios Stamatatos,et al.
Automatic Text Categorization In Terms Of Genre and Author
,
2000,
CL.
[7]
Robert P. W. Duin,et al.
Support objects for domain approximation
,
1998
.