A framework for authorship identification of online messages: Writing-style features and classification techniques

With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. We developed a framework for authorship identification of online messages to address the identity-tracing problem. In this framework, four types of writing-style features (lexical, syntactic, structural, and content-specific features) are extracted and inductive learning algorithms are used to build feature-based classification models to identify authorship of online messages. To examine this framework, we conducted experiments on English and Chinese online-newsgroup messages. We compared the discriminating power of the four types of features and of three classification techniques: decision trees, back-propagation neural networks, and support vector machines. The experimental results showed that the proposed approach was able to identify authors of online messages with satisfactory accuracy of 70 to 95%. All four types of message features contributed to discriminating authors of online messages. Support vector machines outperformed the other two classification techniques in our experiments. The high performance we achieved for both the English and Chinese datasets showed the potential of this approach in a multiple-language context. Introduction The rapid development and proliferation of Internet technologies and applications have created a new way to share information across time and space. A wide range of activities have evolved over the Internet, ranging from simple information exchange and resource sharing to virtual communications and e-commerce activities. In particular, online messages are being extensively used to distribute information over Web-based channels such as e-mail, Web sites, Internet newsgroups, and Internet chat rooms. Unfortunately, online messages also can be misused for the distribution of unsolicited or inappropriate information such as junk mail (commonly referred to " spamming ") and offensive/threatening messages. Moreover, criminals have been using online messages to distribute illegal materials, including pirated software, child pornography materials, stolen properties, and so on. In addition, criminal or terrorist organizations also use online messages as one of their major communication channels. These activities have spawned the concept of " cybercrime. " Cybercrime was defined by Thomas and Loader (2000) as illegal computer-mediated activities which can be conducted through global electronic networks. A common characteristic of online messages is anonymity. People usually do not need to provide their real identity information such as name, age, gender, and address. In many misuse or crime cases of online messages, the sender will …

[1]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[2]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[3]  M. Kendall The Statistical Study of Literary Vocabulary , 1944, Nature.

[4]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[5]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[6]  Roger Mitton,et al.  Spelling checkers, spelling correctors and the misspellings of poor spellers , 1987, Inf. Process. Manag..

[7]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[8]  John F. Burrows,et al.  ‘An ocean where each kind. . .’: Statistical analysis and some major determinants of literary style , 1989, Comput. Humanit..

[9]  W. Shakespeare,et al.  Shakespeare, Fletcher and "The Two Noble Kinsmen" , 1990 .

[10]  Thomas G. Dietterich,et al.  A Comparative Study of ID3 and Backpropagation for English Text-to-Speech Mapping , 1990, ML.

[11]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[12]  Thomas Merriam,et al.  Shakespeare, Fletcher, and the Two Noble Kinsmen , 1994 .

[13]  Bernard Widrow,et al.  Neural networks: applications in industry, business and science , 1994, CACM.

[14]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[15]  Bradley Kjell,et al.  Authorship Determination Using Letter Pair Frequency Features with Neural Network Classifiers , 1995 .

[16]  D. Lowe,et al.  Shakespeare vs. fletcher: A stylometric analysis by radial basis functions , 1995, Comput. Humanit..

[17]  Colin Martindale,et al.  On the utility of content analysis in author attribution:The Federalist , 1995, Comput. Humanit..

[18]  D. L. Mealand Correspondence Analysis of Luke , 1995 .

[19]  Douglas Biber,et al.  Dimensions of Register Variation: A Cross-Linguistic Comparison , 1995 .

[20]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[21]  A. Q. Morton,et al.  Analysing for authorship : a guide to the cusum technique , 1996 .

[22]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[23]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[24]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[25]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[26]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Hsinchun Chen,et al.  A Machine Learning Approach to Inductive Query by Examples : An Experiment Using Relevance Feedback , ID 3 , Genetic Algorithms , and Simulated Annealing , 1998 .

[28]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  Jacek M. Zurada,et al.  Neural Networks And Hybrid Intelligent Models: Foundations, Theory, And Applications , 1998, IEEE Trans. Neural Networks.

[31]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[32]  Hsinchun Chen,et al.  A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing , 1998, J. Am. Soc. Inf. Sci..

[33]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[34]  Johan F. Hoorn,et al.  Neural network identification of poets using letter sequences , 1999 .

[35]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[36]  Hugh Craig Authorial attribution and computational stylistics: if you can tell authors apart, have you learned anything about them? , 1999 .

[37]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[38]  B. Loader,et al.  Cybercrime : law enforcement, security and surveillance in the information age , 2000 .

[39]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[40]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[41]  Hsinchun Chen,et al.  Estimating drug/plasma concentration levels by applying neural networks to pharmacokinetic data sets , 2000, Decis. Support Syst..

[42]  Frank L. Lewis,et al.  Optimal design of CMAC neural-network controller for robot manipulators , 2000, IEEE Trans. Syst. Man Cybern. Part C.

[43]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[44]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[45]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[46]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[47]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[48]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[49]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[50]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[51]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[52]  Stephen G. MacDonell,et al.  Software Forensics: Extending Authorship Analysis Techniques to Computer Programs , 2002 .

[53]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[54]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[55]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[56]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[57]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.