APPLYING TOPIC MODELLING ON FORENSIC DATA: A CASE STUDY

Most actionable evidence for investigation purposes is identified during the analysis phase of a digital investigation process. The objective of the analysis phase (digital analysis) is to reduce the quantity and enhance the intelligibility of data that must be reviewed by a human analyst. Currently, this is done through expression based searching, which assumes a good understanding of the evidence prior to the search. Therefore, latent evidence will not be found with such methods. This suggests a clear role for knowledge discovery and data mining (KDD) techniques to enhance the digital analysis process. The research described in this article investigates the application of topic modelling as a KDD technique on forensic data and its ability to contribute to digital analysis. Topic models infer the underlying semantic context of a text collection and summarises it as topics described by words. The data used for this application was extracted from a real digital investigation case. This novel application highlights several challenges that forensic data poses to topic modelling algorithms and we report on lessons learned from the

[1]  Alta de Waal,et al.  Specializing CRISP-DM for Evidence Mining , 2007, IFIP Int. Conf. Digital Forensics.

[2]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[3]  François Yvon,et al.  Inference and evaluation of the multinomial mixture model for text clustering , 2006, Inf. Process. Manag..

[4]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[5]  Eoghan Casey,et al.  Digital Evidence and Computer Crime , 2000 .

[6]  Mark Pollitt,et al.  Exploring Big Haystacks: Data Mining and Knowledge Management , 2006, IFIP Int. Conf. Digital Forensics.

[7]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[8]  Alta de Waal,et al.  Named entity recognition in a South African context , 2006 .

[9]  Colleen McCue,et al.  Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis , 2006 .

[10]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Nicole Beebe,et al.  Digital forensic text string searching: Improving information retrieval effectiveness by thematically clustering search results , 2007, Digit. Investig..

[14]  Gerrit Reinier Botha Text-based language identification for the South African languages , 2008 .

[15]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[16]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[17]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.