A Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages

With the popularity of Internet technologies and applications, inappropriate or illegal online messages have become a problem for the society. The goal of authorship attribution for anonymous online messages is to identify the authorship from a group of potential suspects for investigation identification. Most previous contributions focused on extracting various writing-style features and employing machine learning algorithms to identify the author. However, as far as Chinese online messages are concerned, they contain not only Chinese characters but also English characters, special symbols, emoticons, slang, etc. It is challenging for word segmentation techniques to segment Chinese online messages correctly. Moreover, online messages are usually short. The performance for short samples would be decreased greatly using traditional machine learning algorithms. In this paper, a profile-based authorship attribution approach for Chinese online messages is firstly provided. N-gram techniques are employed to extract frequency sequences, and the category frequency feature selection method is used to filter common frequent sequences. The profile-based method is used to represent the suspects as category profiles. The illegal messages are attributed to the most likely authorship by comparing the similarity between unknown illegal online messages and suspects' profiles. Experiments on BBS, Blog, and E-mail datasets show that the proposed profile-based authorship attribution approach can identify the authors effectively. Compared with two instance-based benchmark methods, the proposed profile-based method can obtain better authorship attribution results.

[1]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[2]  George M. Mohay,et al.  Multi-Topic E-mail Authorship Attribution Forensics , 2001 .

[3]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[4]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[5]  Benjamin C. M. Fung,et al.  A Visualizable Evidence-Driven Approach for Authorship Attribution , 2015, TSEC.

[6]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[7]  Jeffrey D. Ullman,et al.  Mining of Massive Datasets: Data Mining , 2011 .

[8]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[9]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  H. Sichel On a Distribution Law for Word Frequencies , 1975 .

[12]  B. Kjell,et al.  Authorship attribution of text samples using neural networks and Bayesian classifiers , 1994, Proceedings of IEEE International Conference on Systems, Man and Cybernetics.

[13]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[14]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[15]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[16]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[17]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[18]  Roberto Basili,et al.  Robust inference method for profile-based Text Classification , 2000 .

[19]  Eoghan Casey Bs Ma Digital Evidence and Computer Crime: Forensic Science, Computers, and the Internet , 2000 .

[20]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[21]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[22]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[23]  Ying Li,et al.  CWAAP: An Authorship Attribution Forensic Platform for Chinese Web Information , 2014, J. Softw..

[24]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[25]  David Fisher,et al.  Techniques of Crime Scene Investigation : Techniques of Crime Scene Investigation , 2003 .

[26]  Dominique Estival,et al.  TAT: An Author Profiling Tool with Application to Arabic Emails , 2007, ALTA.

[27]  George M. Mohay,et al.  E-Mail Authorship Attribution for Computer Forensics , 2002, Applications of Data Mining in Computer Security.

[28]  Zongkai Yang,et al.  Applying Stylometric Analysis Techniques to Counter Anonymity in Cyberspace , 2012, J. Networks.

[29]  Robert J. Valenza,et al.  Was the Earl of Oxford the true Shakespeare , 1991 .

[30]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[31]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[32]  Johan F. Hoorn,et al.  Neural network identification of poets using letter sequences , 1999 .

[33]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[34]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[35]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.