A likelihood ratio-based evaluation of strength of authorship attribution evidence in SMS messages using N-grams

An experiment in forensic text comparison (FTC) within the likelihood ratio (LR) framework is described. The experiment attempts to determine the strength of authorship attribution evidence modelled with N-grams, which is perhaps one of the most basic automatic modelling techniques. The SMS messages of multiple authors selected from the SMS corpus compiled by the National University of Singapore were used for same- and different-author comparisons. I varied the number of words used for the N-gram modelling (200, 1000, 2000 or 3000 words), and then assessed the performance of each set. The performance of the LR-based FTC system was assessed with the log likelihood ratio cost (Cllr). It is shown in this study that N-grams can be employed within an LR framework to discriminate same-author and different-author SMS texts, but a fairly large amount of data are needed to do it well (i.e. to obtain Cllr < 0.75). It is concluded that the LR framework warrants further examination with different features and processing techniques.

[1]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[2]  I W Evett,et al.  An illustration of the advantages of efficient statistical methods for RFLP analysis in forensic science. , 1993, American journal of human genetics.

[3]  Franco Taroni,et al.  Statistics and the Evaluation of Evidence for Forensic Scientists , 2004 .

[4]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[5]  Colin Aitken,et al.  The use of statistics in forensic science , 1991 .

[6]  Berkant Barla Cambazoglu,et al.  Chat mining: Predicting user and message attributes in computer-mediated communication , 2008, Inf. Process. Manag..

[7]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[8]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[9]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[10]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[11]  C.P. Fuhrman Forensic Value of Backscatter from Email Spam , 2008, 2008 Third International Annual Workshop on Digital Forensics and Incident Analysis.

[12]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[13]  Colin Aitken,et al.  Evaluation of trace evidence in the form of multivariate data , 2004 .

[14]  Bernard Robertson,et al.  Interpreting Evidence: Evaluating Forensic Science in the Courtroom , 1995 .

[15]  David Lindley,et al.  A problem in forensic science , 1977 .

[16]  Thamar Solorio,et al.  Authorship attribution of web forum posts , 2010, 2010 eCrime Researchers Summit.