论文信息 - Projecting Away the Class Imbalance Problem in Author Attribution

Projecting Away the Class Imbalance Problem in Author Attribution

Author identification algorithms attempt to ascribe document to author, with an eye towards diverse application areas including: forensic evidence, authenticating communications, and intelligence gathering. We view author identification as a single label classification problem, where 2000 authors would imply 2000 possible categories to assign to a post. Experiments with a naive Bayes classifier on a blog author identification task demonstrate a remarkable tendency to over-predict the most prolific authors. Literature search confirms that the class imbalance phenomenon is a challenge for author identification as well as other machine learning tasks. We develop a vector projection method to remove this hazard, and achieve a 63% improvement in accuracy over the baseline on the same task. Our method adds no additional asymptotic computational complexity to naive Bayes, and has no free parameters to set. The projection technique will likely prove useful for other natural language tasks exhibiting class imbalance.

[1] Efstathios Stamatatos,et al. Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[2] George M. Mohay,et al. Mining e-mail content for author identification forensics , 2001, SGMD.

[3] T C Mendenhall,et al. THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[4] G. Yule. ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[5] David Sharp,et al. Ngram and Bayesian Classification of Documents for Topic and Authorship , 2003, Lit. Linguistic Comput..

[6] Shlomo Argamon,et al. Interpreting Burrows's Delta: Geometric and Probabilistic Foundations , 2007, Lit. Linguistic Comput..

[7] Tibor Kiss,et al. Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.