N-Gram Feature Selection for Authorship Identification

Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.

[1]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[2]  Shlomo Argamon,et al.  Author Identification on the Large Scale , 2005 .

[3]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[4]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[5]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[6]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[9]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[12]  Eugene H. Spafford,et al.  Software forensics: Tracking code to its authors , 1993 .

[13]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[14]  Ophir Frieder,et al.  Discrimination of Authorship Using Visualization , 1994, Inf. Process. Manag..

[15]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[16]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[17]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[20]  David Madigan,et al.  On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[23]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.

[24]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization , 1999, SIGIR 1999.

[25]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[26]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[27]  Cyril Labbé,et al.  Inter-Textual Distance and Authorship Attribution Corneille and Molière , 2001, J. Quant. Linguistics.