Statistical Disk Cluster Classification for File Carving

File carving is the process of recovering files from a disk without the help of a file system. In forensics, it is a helpful tool in finding hidden or recently removed disk content. Known signatures in file headers and footers are especially useful in carving such files out, that is, from header until footer. However, this approach assumes that file clusters remain in order. In case of file fragmentation, file clusters can be disconnected and the order can even be disrupted such that straighforward carving will fail. In this paper, we focus on methods for classifying clusters into file types by using the statistics of the clusters. By not exploiting the possible embedded signatures, we generate evidence from a different source that can be integrated later on. We propose a set of characteristic features and use statistical pattern recognition to learn a supervised classification model for a range of relevant file types. We exploit the statistics of a restricted number of neighboring clusters (context) to improve classification performance. In the experiments we show that the proposed features indeed enable the differentation of clusters into file types. Moreover, for some file types the incorporation of cluster context improves the recognition performance significantly.

[1]  Golden G. Richard,et al.  Next-generation digital forensics , 2006, CACM.

[2]  Calton Pu,et al.  Resilient trust management for Web service integration , 2005, IEEE International Conference on Web Services (ICWS'05).

[3]  David G. Stork,et al.  Pattern Classification , 1973 .

[4]  Robert F. Erbacher,et al.  Identification and Localization of Data Types within Large-Scale File Systems , 2007, Second International Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE'07).

[5]  Julita Vassileva,et al.  Trust and reputation model in peer-to-peer networks , 2003, Proceedings Third International Conference on Peer-to-Peer Computing (P2P2003).

[6]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[7]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[8]  Karl Aberer,et al.  Managing trust in a peer-2-peer information system , 2001, CIKM '01.

[9]  Hector Garcia-Molina,et al.  Quantifying agent strategies under reputation , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[10]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[11]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[12]  Salvatore J. Stolfo,et al.  Towards Stealthy Malware Detection , 2007, Malware Detection.

[13]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[14]  Nahid Shahmehri,et al.  Oscar - File Type Identification of Binary Data in Disk Clusters and RAM Pages , 2006, SEC.

[15]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[16]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.