论文信息 - Kernel Methods and String Kernels for Authorship Analysis

Kernel Methods and String Kernels for Authorship Analysis

This paper presents our approach to the PAN 2012 Traditional Author- ship Attribution tasks and the Sexual Predator Identification task. We approached these tasks with machine learning methods that work at the character level. More precisely, we treated texts as just sequences of symbols (strings) and used string kernels in conjunction with different kernel-based learning methods: supervised and unsupervised. The results were extremely good, we ranked first in most prob- lem and overall in the traditional authorship attribution task, according to the evaluation provided by the organizers.

Cristian Grozea | Marius Popescu

[1] Robert Tibshirani,et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[2] Roman Rosipal,et al. Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[3] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[4] Cristian Grozea,et al. ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[5] Simon Günter,et al. Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[6] Patrick Juola,et al. Authorship Attribution , 2008, Found. Trends Inf. Retr..

[7] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[8] Ulrike von Luxburg,et al. A tutorial on spectral clustering , 2007, Stat. Comput..

[9] Shlomo Argamon,et al. Overview of the International Authorship Identification Competition at PAN-2011 , 2011, CLEF.

[10] Nello Cristianini,et al. Classification using String Kernels , 2000 .

[11] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2004 .

[12] Marius Popescu,et al. Studying Translationese at the Character Level , 2011, RANLP.