File Fragment Classification using Content Based Analysis

One of the major components in Digital Forensics is the extraction of files from a criminal’s hard drives. To achieve this, several techniques are used. One of these techniques is using file carvers. File carvers are used when the system metadata or the file table is damaged but the contents of the hard drive are still intact. File carvers work on the raw fragments in the hard disk and reconstruct files by classifying the fragments and then reassembling them to form the complete file. Hence the classification of file fragments has been an important problem in the field of digital forensics. The work on this problem has mainly relied on finding the specific byte sequences in the file header and footer. However, classification based on header and footer is not reliable as they may be modified or missing. In this project, the goal is to present a machine learningbased approach for content-based analysis to recognize the file types of file fragments. It does so by training a Feed-Forward Neural Network with a 2-byte sequence histogram feature vector which is calculated for each file. These files are obtained from a publicly available file corpus named Govdocs1. The results show that content-based analysis is more reliable than relying on the header and footer data of files.

[1]  Simon Tjoa,et al.  Advanced File Carving Approaches for Multimedia Files , 2011, J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl..

[2]  N. Memon,et al.  The evolution of file carving , 2009, IEEE Signal Processing Magazine.

[3]  Seokjun Lee,et al.  Improved deleted file recovery technique for Ext2/3 filesystem , 2014, The Journal of Supercomputing.

[4]  Yanchao Wang,et al.  File Fragment Type Identification with Convolutional Neural Networks , 2018, ICMLT '18.

[5]  Avdesh Mishra,et al.  Hierarchy-Based File Fragment Classification , 2020, Mach. Learn. Knowl. Extr..

[6]  Conrad D. James,et al.  Sparse Coding for N-Gram Feature Extraction and Training for File Fragment Classification , 2018, IEEE Transactions on Information Forensics and Security.

[7]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[8]  Rong Li,et al.  File Fragment Classification Using Grayscale Image Conversion and Deep Learning in Digital Forensics , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[9]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[10]  Hyunjung Shin,et al.  Fast Content-Based File Type Identification , 2011, IFIP Int. Conf. Digital Forensics.