Decision tree algorithms for image data type identification

Identifying file type of file fragments has been investigated for a long time but it is still a challenge. It is found in the literature that high-entropy file fragments make the problem more complicated. Especially, existing popular file types share same compression algorithms such as deflate algorithm that causes file type identification for file fragment become harder. Applying machine learning or empirical techniques is to deal with this problem. Compression algorithms are used to reduce the size of files that have big data size and include image files. Many research work of file type identification have been done for JPEG format, and the Rate of Change feature is proven to work effectively for it. Conversely, few efforts have been made for PNG although this is a popular image format and widely used nowadays. In this article, we propose a new approach based on the deflate-encoded data detection, entropy-based clustering, and decision tree techniques to identify PNG data fragments which are the deflate-encoded fragments. Experiments showed high accuracy rates for the proposed method.

[1]  Robert F. Erbacher,et al.  Identification and Localization of Data Types within Large-Scale File Systems , 2007, Second International Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE'07).

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  Gregory A. Hall,et al.  Sliding Window Measurement for File Type Identification , 2007 .

[4]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[5]  Drue Coles,et al.  Predicting the types of file fragments , 2008, Digit. Investig..

[6]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[7]  Jingjing Lu,et al.  Comparing naive Bayes, decision trees, and SVM with AUC and accuracy , 2003, Third IEEE International Conference on Data Mining.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Yvonne Locker Introducing the office , 1994 .

[10]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[11]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[12]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Jungheum Park,et al.  Data Extraction from Damage Compressed File for Computer Forensic Purposes , 2008 .

[15]  Thomas Boutell,et al.  PNG (Portable Network Graphics) Specification Version 1.0 , 1997, RFC.

[16]  Stefano Zanero,et al.  File Block Classification by Support Vector Machine , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[17]  Minghe Sun,et al.  Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification , 2013, IEEE Transactions on Information Forensics and Security.

[18]  Vassil Roussev,et al.  File fragment encoding classification - An empirical approach , 2013, Digit. Investig..

[19]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[20]  Matthew M. Shannon Forensic Relative Strength Scoring: ASCII and Entropy Scoring , 2004, Int. J. Digit. EVid..

[21]  Ponnuthurai N. Suganthan,et al.  A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification , 2010, SAISMC.

[22]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[23]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[24]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[25]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[26]  Stefan Axelsson,et al.  The Normalised Compression Distance as a file fragment classifier , 2010, Digit. Investig..

[27]  Wei Fan,et al.  Bagging , 2009, Encyclopedia of Machine Learning.

[28]  Wanli Ma,et al.  The impact of data fragment sizes on file type recognition , 2014, 2014 10th International Conference on Natural Computation (ICNC).

[29]  William J. Buchanan,et al.  Approaches to the classification of high entropy file fragments , 2013, Digit. Investig..

[30]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[31]  Wanli Ma,et al.  A Proposed Approach to Compound File Fragment Identification , 2014, NSS.

[32]  Simson L. Garfinkel,et al.  File Fragment Classification-The Case for Specialized Approaches , 2009, 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering.