The impact of data fragment sizes on file type recognition

Determining the original file type of data fragments helps data recovery, spam detection, virus scanning, and network monitoring operations. In many cases, only unordered fragments of the original file are available for investigation. Therefore, we can only base on the content of a fragment to identify its file type. However, data fragments come with different sizes, as they may be the residual data recovered from storage media or network packets. It is stated that identifying the file type of larger fragments is easier than the smaller size ones [1]. Therefore, it is important to study the impact of data fragment sizes on file type recognition. In this paper, we study the results of applying machine learning technique to identify file types of data fragments of different sizes in order to find the minimum size required for file type recognition purpose.

[1]  Ponnuthurai N. Suganthan,et al.  A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification , 2010, SAISMC.

[2]  Simson L. Garfinkel,et al.  File Fragment Classification-The Case for Specialized Approaches , 2009, 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering.

[3]  N. Shahmehri,et al.  File Type Identification of Data Fragments by Their Binary Structure , 2006, 2006 IEEE Information Assurance Workshop.

[4]  Cor J. Veenman Statistical Disk Cluster Classification for File Carving , 2007, Third International Symposium on Information Assurance and Security.

[5]  M. Chatterjee,et al.  Secure E-Commerce Protocol for Purchase of e-Goods - Using Smart Card , 2007 .

[6]  Stefan Axelsson,et al.  The Normalised Compression Distance as a file fragment classifier , 2010, Digit. Investig..

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  Simson L. Garfinkel,et al.  Bringing science to digital forensics with standardized forensic corpora , 2009, Digit. Investig..

[9]  Yiming Yang,et al.  Statistical Learning for File-Type Identification , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[10]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[11]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[12]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[13]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[14]  Stefano Zanero,et al.  File Block Classification by Support Vector Machine , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[15]  Sergey Bratus,et al.  Automated mapping of large binary objects using primitive fragment type classification , 2010, Digit. Investig..

[16]  Vassil Roussev,et al.  File fragment encoding classification - An empirical approach , 2013, Digit. Investig..