Mining writeprints from anonymous e-mails for forensic investigation

Many criminals exploit the convenience of anonymity in the cyber world to conduct illegal activities. E-mail is the most commonly used medium for such activities. Extracting knowledge and information from e-mail text has become an important step for cybercrime investigation and evidence collection. Yet, it is one of the most challenging and time-consuming tasks due to special characteristics of e-mail dataset. In this paper, we focus on the problem of mining the writing styles from a collection of e-mails written by multiple anonymous authors. The general idea is to first cluster the anonymous e-mail by the stylometric features and then extract the writeprint, i.e., the unique writing style, from each cluster. We emphasize that the presented problem together with our proposed solution is different from the traditional problem of authorship identification, which assumes training data is available for building a classifier. Our proposed method is particularly useful in the initial stage of investigation, in which the investigator usually have very little information of the case and the true authors of suspicious e-mail collection. Experiments on a real-life dataset suggest that clustering by writing style is a promising approach for grouping e-mails written by the same author.

[1]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[2]  Rong Zheng,et al.  From fingerprint to writeprint , 2006, Commun. ACM.

[3]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[4]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[5]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[6]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[7]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[8]  G. Udny Yule,et al.  The statistical study of literary vocabulary , 1944 .

[9]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[10]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[11]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[12]  Jay F. Nunamaker,et al.  Stylometric Identification in Electronic Markets: Scalability and Robustness , 2008, J. Manag. Inf. Syst..

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  Hua Li,et al.  Adding Semantics to Email Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[20]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[21]  Michael Gamon,et al.  Linguistic correlates of style: authorship classification with deep linguistic analysis features , 2004, COLING.

[22]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[23]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[24]  Salvatore J. Stolfo,et al.  A temporal based forensic analysis of electronic communication , 2006, DG.O.

[25]  M. Jackson,et al.  Shakespeare, Fletcher, and The Two Noble Kinsmen. , 1990 .

[26]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[27]  W. Shakespeare,et al.  Shakespeare, Fletcher and "The Two Noble Kinsmen" , 1990 .

[28]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[29]  Edward Vanhoutte Literary and Linguistic Computing , 1986 .

[30]  Jasmine Novak,et al.  Anti-aliasing on the web , 2004, WWW '04.

[31]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..