Using NLP techniques for file fragment classification

Abstract The classification of file fragments is an important problem in digital forensics. The literature does not include comprehensive work on applying machine learning techniques to this problem. In this work, we explore the use of techniques from natural language processing to classify file fragments. We take a supervised learning approach, based on the use of support vector machines combined with the bag-of-words model, where text documents are represented as unordered bags of words. This technique has been repeatedly shown to be effective and robust in classifying text documents (e.g., in distinguishing positive movie reviews from negative ones). In our approach, we represent file fragments as “bags of bytes” with feature vectors consisting of unigram and bigram counts, as well as other statistical measurements (including entropy and others). We made use of the publicly available Garfinkel data corpus to generate file fragments for training and testing. We ran a series of experiments, and found that this approach is effective in this domain as well.

[1]  Salvatore J. Stolfo,et al.  Fileprint analysis for Malware Detection 1 , 2005 .

[2]  Simson L. Garfinkel,et al.  Carving contiguous and fragmented files with fast object validation , 2007, Digit. Investig..

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[5]  Julian Seward Space-time tradeoffs in the inverse B-W transform , 2001, Proceedings DCC 2001. Data Compression Conference.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  Stefan Axelsson,et al.  The Normalised Compression Distance as a file fragment classifier , 2010, Digit. Investig..

[9]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[10]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[11]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[12]  Ponnuthurai N. Suganthan,et al.  A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification , 2010, SAISMC.

[13]  Yiming Yang,et al.  Statistical Learning for File-Type Identification , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[14]  Cor J. Veenman Statistical Disk Cluster Classification for File Carving , 2007, Third International Symposium on Information Assurance and Security.

[15]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[16]  Drue Coles,et al.  Predicting the types of file fragments , 2008, Digit. Investig..

[17]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[18]  Sergey Bratus,et al.  Automated mapping of large binary objects using primitive fragment type classification , 2010, Digit. Investig..