论文信息 - Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution

Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution

Traditional Authorship Attribution models extract normalized counts of lexical elements such as nouns, common words and punctuation and use these normalized counts or ratios as features for author fingerprinting. The text is viewed as a bag-of-words and the order of words and their position relative to other words is largely ignored. We propose a new method of feature extraction which quantifies the distribution of lexical elements within the text using Kolmogorov complexity estimates. Testing carried out on blog corpora indicates that such measures outperform ratios when used as features in an SVM authorship attribution model. Moreover, by adding complexity estimates to a model using ratios, we were able to increase the F-measure by 5.2-11.8%

[1] Ming Li,et al. An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[2] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3] Fuchun Peng,et al. N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[4] Shlomo Argamon,et al. Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[5] Shlomo Argamon,et al. Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[6] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7] Efstathios Stamatatos,et al. Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[8] Eric Brill,et al. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[9] Boris Katz,et al. A Comparative Study of Language Models for Book and Author Recognition , 2005, IJCNLP.

[10] Efstathios Stamatatos,et al. Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..