A comparative study of support vector machine and neural networks for file type identification using n-gram analysis

File type identification (FTI) has become a major discipline for anti-virus developers, firewall designers and for forensic cybercrime investigators. Over the past few years, research has seen the introduction of several classifiers and features. One of these advances is the so-called n-grams analysis, which is an interpretation of statistical counting in classified fragments. Recently, n-grams based approaches were already successfully combined with computational intelligence classifiers. However, the academic body of literature is scant when it comes to a comprehensive explanation of machine learning based approaches such as neural networks (NN) or support vector machines (SVM). For example, how the input parameters, including learning rate, different values of n for n-grams, etc. influence the results. In addition, very few studies have compared the scalability of NN vs. SVM approaches. Therefore, a systematic research in comparing different approaches is needed to address these questions. Hence, this paper investigates this type of comparison, by focusing on the n-gram analysis as a feature for the two different classifiers: SVMs and NNs. This paper details our experiments with two NNs and four SVMs, using linear kernels and RBF kernels on RealDC datasets. In general, we found that SVM-based approaches performed better than the NN, but their scalability is still a challenge. © 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

[1]  Gregory B. White,et al.  An Approach to Detect Executable Content for Anomaly Based Network Intrusion Detection , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Kyung-suk Lhee,et al.  Classification of packet contents for malware detection , 2011, Journal in Computer Virology.

[3]  Ponnuthurai N. Suganthan,et al.  A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification , 2010, SAISMC.

[4]  Slawomir Grzonkowski,et al.  Enabling Trust in Deep Learning Models: A Digital Forensics Case Study , 2018, 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE).

[5]  Mohsen Toorani,et al.  Feature-based Type Identification of File Fragments , 2013, Secur. Commun. Networks.

[6]  Stefano Zanero,et al.  Context-Based File Block Classification , 2012, IFIP Int. Conf. Digital Forensics.

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  David Defour,et al.  Using Graphics Processors for Parallelizing Hash-Based Data Carving , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[10]  Simson L. Garfinkel,et al.  Forensic feature extraction and cross-drive analysis , 2006, Digit. Investig..

[11]  Harald Baier,et al.  FRASH: A framework to test algorithms of similarity hashing , 2013, Digit. Investig..

[12]  Jesse D. Kornblum Identifying almost identical files using context triggered piecewise hashing , 2006, Digit. Investig..

[13]  Simson L. Garfinkel,et al.  Hash-based carving: Searching media for complete files and file fragments with sector hashing and hashdb , 2015, Digit. Investig..

[14]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[15]  Simson L. Garfinkel,et al.  Distinct Sector Hashes for Target File Detection , 2012, Computer.

[16]  James George Dunham,et al.  Classifying file type of stream ciphers in depth using neural networks , 2005, The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005..

[17]  Avdesh Mishra,et al.  Hierarchy-Based File Fragment Classification , 2020, Mach. Learn. Knowl. Extr..

[18]  Conrad D. James,et al.  Sparse Coding for N-Gram Feature Extraction and Training for File Fragment Classification , 2018, IEEE Transactions on Information Forensics and Security.

[19]  Vassil Roussev,et al.  An evaluation of forensic similarity hashes , 2011, Digit. Investig..

[20]  Chenniappan Chellappan,et al.  File format identification and information extraction , 2009, 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC).

[21]  Xiaoyu Du,et al.  Deduplicated Disk Image Evidence Acquisition and Forensically-Sound Reconstruction , 2018, 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE).

[22]  Drue Coles,et al.  Predicting the types of file fragments , 2008, Digit. Investig..

[23]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[24]  Minghe Sun,et al.  Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification , 2013, IEEE Transactions on Information Forensics and Security.

[25]  Stefano Zanero,et al.  File Block Classification by Support Vector Machine , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[26]  Vassil Roussev,et al.  Content triage with similarity digests: The M57 case study , 2012 .

[27]  Ryan C. Mayer Filetype identification using long, summarized n-grams , 2011 .

[28]  Manpyo Hong,et al.  Content-based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach , 2010 .

[29]  Mohsen Toorani,et al.  A new approach to content-based file type detection , 2008, 2008 IEEE Symposium on Computers and Communications.

[30]  Simon Tjoa,et al.  Taxonomy of Data Fragment Classification Techniques , 2013, ICDF2C.

[31]  Sudarshan S. Chawathe,et al.  Effective whitelisting for filesystem forensics , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.

[32]  Célia Ghedini Ralha,et al.  A New Approach for Creating Forensic Hashsets , 2012, IFIP Int. Conf. Digital Forensics.

[33]  Cor J. Veenman Statistical Disk Cluster Classification for File Carving , 2007, Third International Symposium on Information Assurance and Security.

[34]  Yiming Yang,et al.  Statistical Learning for File-Type Identification , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[35]  Jason M. Carter Locating executable fragments with Concordia, a scalable, semantics-based architecture , 2013, CSIIRW '13.

[36]  Sergey Bratus,et al.  Automated mapping of large binary objects using primitive fragment type classification , 2010, Digit. Investig..

[37]  John Daniel Evensen Clustered File Type Identification , 2015 .

[38]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[39]  Nhien-An Le-Khac,et al.  One-Class Collective Anomaly Detection Based on LSTM-RNNs , 2017, Trans. Large Scale Data Knowl. Centered Syst..

[40]  Frank Breitinger,et al.  Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees , 2017, ICDF2C.

[41]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[42]  N. Shahmehri,et al.  File Type Identification of Data Fragments by Their Binary Structure , 2006, 2006 IEEE Information Assurance Workshop.

[43]  Meijuan Yin,et al.  Feature selection based file type identification algorithm , 2010, 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[44]  Nhien-An Le-Khac,et al.  Black Box Attacks on Explainable Artificial Intelligence(XAI) methods in Cyber Security , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[45]  Stefan Axelsson,et al.  The Normalised Compression Distance as a file fragment classifier , 2010, Digit. Investig..

[46]  Vassil Roussev,et al.  Data Fingerprinting with Similarity Digests , 2010, IFIP Int. Conf. Digital Forensics.

[47]  Robert F. Erbacher,et al.  SÁDI - Statistical Analysis for Data Type Identification , 2008, 2008 Third International Workshop on Systematic Approaches to Digital Forensic Engineering.

[48]  Simson L. Garfinkel,et al.  Using purpose-built functions and block hashes to enable small block and sub-file forensics , 2010, Digit. Investig..

[49]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[50]  Riccardo Poli,et al.  GP-Fileprints: File Types Detection Using Genetic Programming , 2010, EuroGP.

[51]  Simson L. Garfinkel,et al.  File Fragment Classification-The Case for Specialized Approaches , 2009, 2009 Fourth International IEEE Workshop on Systematic Approaches to Digital Forensic Engineering.

[52]  Vassil Roussev,et al.  Real-time digital forensics and triage , 2013, Digit. Investig..

[53]  Minghe Sun,et al.  Data Type Classification: Hierarchical Class-to-Type Modeling , 2016, IFIP Int. Conf. Digital Forensics.

[54]  Yoginder S. Dandass,et al.  An Empirical Analysis of Disk Sector Hashes for Data Carving , 2008, J. Digit. Forensic Pract..

[55]  Nahid Shahmehri,et al.  Oscar - File Type Identification of Binary Data in Disk Clusters and RAM Pages , 2006, SEC.

[56]  Robert F. Erbacher,et al.  Identification and Localization of Data Types within Large-Scale File Systems , 2007, Second International Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE'07).

[57]  Gregory A. Hall,et al.  Sliding Window Measurement for File Type Identification , 2007 .

[58]  Simson L. Garfinkel,et al.  Digital forensics research: The next 10 years , 2010, Digit. Investig..

[59]  N. Memon,et al.  The evolution of file carving , 2009, IEEE Signal Processing Magazine.

[60]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .