BYTE FREQUENCY ANALYSIS DESCRIPTOR WITH SPATIAL INFORMATION FOR FILE FRAGMENT CLASSIFICATION

Digital forensic is generally about recovering and investigating digital devices such as PC and mobile phones. Examining information and extracting evidences from the digital devices are not an easy task. In data recovery for example, the successful of recovering the digital information is highly dependent on how a method is able to understand the content of a document effectively. The more the system is able to understand the content of documents the more effective is will be in recovering the desired documents. One of the challenging issues in recovering documents is to determine the type of file fragments from an incomplete structure of documents. One possible solution to the problem is based on statistical analysis such as the byte frequency analysis for feature description. The byte frequency analysis computers a global descriptor and provides a statistical distribution from file fragments. However, one possible problem of this solution is to create a global histogram input vector for a machine learning classifier, such as support vector machine that is insensitive to small changes in the file fragment content. Besides, it does not include any spatial information, and liable to false positive especially for large datasets. Therefore, the byte frequency analysis with circular representation is proposed, where a set of file fragments is divided into several blocks using a fixed partitioning scheme. Then, for each block the lower-level byte frequency analysis descriptor feature is used to represent the partitions. After that, all features are combined to create one large input vector for machine learning classifier for classification. We have performed experiments on 10 different file categories at three different resolutions i.e. level0, level1, level2 and combination of several these resolutions. The results show that the proposed method slightly outperforms the single byte frequency analysis distribution. Keyword: digital forensic, byte frequency analysis, support vector machine, spatial information circular scheme

[1]  Stefano Zanero,et al.  File Block Classification by Support Vector Machine , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[2]  Hyunjung Shin,et al.  On Improving the Accuracy and Performance of Content-Based File Type Identification , 2009, ACISP.

[3]  Minghe Sun,et al.  Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification , 2013, IEEE Transactions on Information Forensics and Security.

[4]  Ponnuthurai N. Suganthan,et al.  A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification , 2010, SAISMC.

[5]  Drue Coles,et al.  Predicting the types of file fragments , 2008, Digit. Investig..

[6]  Nahid Shahmehri,et al.  Oscar - File Type Identification of Binary Data in Disk Clusters and RAM Pages , 2006, SEC.

[7]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[8]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[9]  Yiming Yang,et al.  Statistical Learning for File-Type Identification , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[10]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[11]  Mohsen Toorani,et al.  Feature-based Type Identification of File Fragments , 2013, Secur. Commun. Networks.

[12]  Cor J. Veenman Statistical Disk Cluster Classification for File Carving , 2007, Third International Symposium on Information Assurance and Security.

[13]  N. Shahmehri,et al.  File Type Identification of Data Fragments by Their Binary Structure , 2006, 2006 IEEE Information Assurance Workshop.

[14]  Stefan Axelsson,et al.  The Normalised Compression Distance as a file fragment classifier , 2010, Digit. Investig..