Clustering digital forensic string search output

This research comparatively evaluates four competing clustering algorithms for thematically clustering digital forensic text string search output. It does so in a more realistic context, respecting data size and heterogeneity, than has been researched in the past. In this study, we used physical-level text string search output, consisting of over two million search hits found in nearly 50,000 allocated files and unallocated blocks. Holding the data set constant, we comparatively evaluated k-Means, Kohonen SOM, Latent Dirichlet Allocation (LDA) followed by k-Means, and LDA followed by SOM. This enables true cross-algorithm evaluation, whereas past studies evaluated singular algorithms using unique, non-reproducible datasets. Our research shows an LDAź+źk-Means using a linear, centroid-based user navigation procedure produces optimal results. The winning approach increased information retrieval effectiveness, from the baseline random walk absolute precision rate of 0.04, to an average precision rate of 0.67. We also explored a variety of algorithms for user navigation of search hit results, finding that the performance of k-means clustering can be greatly improved with a non-linear, non-centroid-based cluster and document navigation procedure, which has potential implications for digital forensic tools and use thereof, particularly given the popularity and speed of k-means clustering.

[1]  Mourad Debbabi,et al.  Towards an integrated e-mail forensic analysis framework , 2009, Digit. Investig..

[2]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[3]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[4]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[5]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[6]  Nicole Beebe,et al.  A New Process Model for Text String Searching , 2007, IFIP Int. Conf. Digital Forensics.

[7]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[8]  Nicole Beebe,et al.  Post-retrieval search hit clustering to improve information retrieval effectiveness: Two digital forensics case studies , 2011, Decis. Support Syst..

[9]  Eduardo R. Hruschka,et al.  Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection , 2013, IEEE Transactions on Information Forensics and Security.

[10]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Judith Redi,et al.  Text Clustering for Digital Forensics Analysis , 2009, CISIS.

[13]  Nicole Beebe,et al.  Digital forensic text string searching: Improving information retrieval effectiveness by thematically clustering search results , 2007, Digit. Investig..

[14]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[15]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .

[16]  Alta de Waal,et al.  Applying Topic Modeling to Forensic Data , 2008, IFIP Int. Conf. Digital Forensics.

[17]  Arjen P. de Vries,et al.  XIRAF - XML-based indexing and querying for digital forensics , 2006, Digit. Investig..