A Machine Learning-based Triage methodology for automated categorization of digital media

The global diffusion of smartphones and tablets, exceeding traditional desktops and laptops market share, presents investigative opportunities and poses serious challenges to law enforcement agencies and forensic professionals. Traditional Digital Forensics techniques, indeed, may be no longer appropriate for timely analysis of digital devices found at the crime scene. Nevertheless, dealing with specific crimes such as murder, child abductions, missing persons, death threats, such activity may be crucial to speed up investigations. Motivated by this, the paper explores the field of Triage, a relatively new branch of Digital Forensics intended to provide investigators with actionable intelligence through digital media inspection, and describes a new interdisciplinary approach that merges Digital Forensics techniques and Machine Learning principles. The proposed Triage methodology aims at automating the categorization of digital media on the basis of plausible connections between traces retrieved (i.e. digital evidence) and crimes under investigation. As an application of the proposed method, two case studies about copyright infringement and child pornography exchange are then presented to actually prove that the idea is viable. The term ''feature'' will be regarded in the paper as a quantitative measure of a ''plausible digital evidence'', according to the Machine Learning terminology. In this regard, we (a) define a list of crime-related features, (b) identify and extract them from available devices and forensic copies, (c) populate an input matrix and (d) process it with different Machine Learning mining schemes to come up with a device classification. We perform a benchmark study about the most popular mining algorithms (i.e. Bayes Networks, Decision Trees, Locally Weighted Learning and Support Vector Machines) to find the ones that best fit the case in question. Obtained results are encouraging as we will show that, triaging a dataset of 13 digital media and 45 copyright infringement-related features, it is possible to obtain more than 93% of correctly classified digital media using Bayes Networks or Support Vector Machines while, concerning child pornography exchange, with a dataset of 23 cell phones and 23 crime-related features it is possible to classify correctly 100% of the phones. In this regards, methods to reduce the number of linearly independent features are explored and classification results presented.

[1]  Wayne Jansen,et al.  Guidelines on Cell Phone Forensics , 2007 .

[2]  Gianluigi Me,et al.  Data Mining based Crime-Dependent Triage in Digital Forensics Analysis , 2012 .

[3]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[4]  Gianluigi Me,et al.  A Quantitative Approach to Triaging in Mobile Forensics , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[5]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Marcus K. Rogers,et al.  Computer Forensics Field Triage Process Model , 2006, J. Digit. Forensics Secur. Law.

[7]  Stephen Pearson,et al.  Digital Triage Forensics: Processing the Digital Crime Scene , 2010 .

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[9]  Bo Wang,et al.  Forward Semi-supervised Feature Selection Based on Relevant Set Correlation , 2008, 2008 International Conference on Computer Science and Software Engineering.

[10]  Gianluigi Me,et al.  Triage-based automated analysis of evidence in court cases of copyright infringement , 2012, 2012 IEEE International Conference on Communications (ICC).

[11]  Timothy Grance,et al.  Guide to Integrating Forensic Techniques into Incident Response , 2006 .

[12]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[13]  P. Deepa Shenoy,et al.  A Data Mining Approach for Data Generation and Analysis for Digital Forensic Application , 2010 .