Applying Topic Modeling to Forensic Data

Most actionable evidence is identified during the analysis phase of digital forensic investigations. Currently, the analysis phase uses expression-based searches, which assume a good understanding of the evidence; but latent evidence cannot be found using such methods. Knowledge discovery and data mining (KDD) techniques can significantly enhance the analysis process. A promising KDD technique is topic modeling, which infers the underlying semantic context of text and summarizes the text using topics described by words. This paper investigates the application of topic modeling to forensic data and its ability to contribute to the analysis phase. Also, it highlights the challenges that forensic data poses to topic modeling algorithms and reports on the lessons learned from a case study.

[1]  Sujeet Shenoi,et al.  Advances in Digital Forensics XII , 2007, IFIP Advances in Information and Communication Technology.

[2]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[3]  Alta de Waal,et al.  Named entity recognition in a South African context , 2006 .

[4]  Mark Pollitt,et al.  Exploring Big Haystacks , 2006 .

[5]  Donna Harman,et al.  Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  François Yvon,et al.  Inference and evaluation of the multinomial mixture model for text clustering , 2006, Inf. Process. Manag..

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[10]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[11]  RigousteLoïs,et al.  Inference and evaluation of the multinomial mixture model for text clustering , 2007 .

[12]  Mark Pollitt,et al.  Advances in Digital Forensics , 2006 .

[13]  Colleen McCue,et al.  Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis , 2006 .

[14]  Sujeet Shenoi,et al.  Advances in Digital Forensics III , 2007 .

[15]  Nicole Beebe,et al.  Digital forensic text string searching: Improving information retrieval effectiveness by thematically clustering search results , 2007, Digit. Investig..

[16]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[17]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[18]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[20]  Eoghan Casey,et al.  Digital Evidence and Computer Crime , 2000 .

[21]  Mark Pollitt,et al.  Exploring Big Haystacks: Data Mining and Knowledge Management , 2006, IFIP Int. Conf. Digital Forensics.

[22]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[23]  Alta de Waal,et al.  Specializing CRISP-DM for Evidence Mining , 2007, IFIP Int. Conf. Digital Forensics.

[24]  Gerrit Reinier Botha Text-based language identification for the South African languages , 2008 .