Improving malicious PDF classifier with feature engineering: A data-driven approach

Abstract Several approaches and tools have been developed to analyse and detect the presence of malicious content within the PDF; however, the fundamental approach in designing the existing tools and techniques has not been entirely considerate. Existing tools are based on the available datasets and the observation made during the maldoc manual analysis, making them susceptible to various types of attacks such as Mimicry and Parser confusion. We aim to enhance PDF maldoc classification by identifying the most conclusive feature-set required for accurately classifying PDF maldocs. We extract features using two popular PDF analysis tools and derive a set of features backed by data that further complements classification. We subsequently evaluate all features through a wrapper function. The features with the highest importance values are used to construct a classifier that outperforms the baseline models in terms of classification accuracy and efficiency. Our proposed method helps us identify a useful set of tool-independent features that prolong the current tools’ lifespan and usability. It provides us with an in-depth understanding of how these chosen features cumulatively impact the classification. In addition, we evaluate our findings using real-world samples from VirusTotal. Using our proposed technique, we managed to decrease the size of the feature-set by more than 60% while increasing the classification accuracy by around 2%.

[1]  Xun Lu,et al.  De-obfuscation and Detection of Malicious PDF Files with High Accuracy , 2013, 2013 46th Hawaii International Conference on System Sciences.

[2]  R. Vinayakumar,et al.  A hybrid deep learning image-based analysis for effective malware detection , 2019, J. Inf. Secur. Appl..

[3]  Jason Zhang,et al.  Machine Learning With Feature Selection Using Principal Component Analysis for Malware Detection: A Case Study , 2019, ArXiv.

[4]  Sean Ekins,et al.  Comparison of Deep Learning With Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets. , 2017, Molecular pharmaceutics.

[5]  Giorgio Giacinto,et al.  Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious PDF files detection , 2013, ASIA CCS '13.

[6]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[7]  Petar Tsankov,et al.  Force Open , 2017 .

[8]  Giorgio Giacinto,et al.  Lux0R: Detection of Malicious PDF-embedded JavaScript code through Discriminant Analysis of API References , 2014, AISec '14.

[9]  Pavel Laskov,et al.  Practical Evasion of a Learning-Based Classifier: A Case Study , 2014, 2014 IEEE Symposium on Security and Privacy.

[10]  Himanshu Pareek,et al.  Entropy and n-gram Analysis of Malicious PDF Documents , 2013 .

[11]  Giorgio Giacinto,et al.  Towards Adversarial Malware Detection , 2018, ACM Comput. Surv..

[12]  Brian Mac Namee,et al.  Deep learning at the shallow end: Malware classification for non-domain experts , 2018, Digit. Investig..

[13]  François Gauthier,et al.  SAFE-PDF: Robust Detection of JavaScript PDF Malware Using Abstract Interpretation , 2018, ArXiv.

[14]  Olivier Levillain,et al.  Caradoc: A Pragmatic Approach to PDF Parsing and Validation , 2016, 2016 IEEE Security and Privacy Workshops (SPW).

[15]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[16]  Angelos Stavrou,et al.  Detecting Malicious Javascript in PDF through Document Instrumentation , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[17]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[18]  Elmar Gerhards-Padilla,et al.  PDF Scrutinizer: Detecting JavaScript-based attacks in PDF documents , 2012, 2012 Tenth Annual International Conference on Privacy, Security and Trust.

[19]  Patrick D. McDaniel,et al.  Adversarial Examples for Malware Detection , 2017, ESORICS.

[20]  Pavel Laskov,et al.  Static detection of malicious JavaScript-bearing PDF documents , 2011, ACSAC '11.

[21]  Thomas R. Dean,et al.  Using clone detection to find malware in acrobat files , 2013, CASCON.

[22]  Derek C. Rose,et al.  Deep Machine Learning - A New Frontier in Artificial Intelligence Research [Research Frontier] , 2010, IEEE Computational Intelligence Magazine.

[23]  Himanshu Pareek Malicious Pdf Document Detection Based on Feature Extraction and Entropy , 2013 .

[24]  Tudor Dumitras,et al.  FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature , 2016, CCS.

[25]  Feng Gu,et al.  A multi-level deep learning system for malware detection , 2019, Expert Syst. Appl..

[26]  Giorgio Giacinto,et al.  A structural and content-based approach for a precise and robust detection of malicious PDF files , 2015, 2015 International Conference on Information Systems Security and Privacy (ICISSP).

[27]  Xue Wang,et al.  Comparison deep learning method to traditional methods using for network intrusion detection , 2016, 2016 8th IEEE International Conference on Communication Software and Networks (ICCSN).

[28]  Giorgio Giacinto,et al.  A Pattern Recognition System for Malicious PDF Files Detection , 2012, MLDM.

[29]  Pavel Laskov,et al.  Hidost: a static machine-learning-based detector of malicious files , 2016, EURASIP J. Inf. Secur..

[30]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[31]  Evangelos P. Markatos,et al.  Combining static and dynamic analysis for the detection of malicious documents , 2011, EUROSEC '11.

[32]  Jianguo Jiang,et al.  Malicious documents detection for business process management based on multi-layer abstract model , 2019, Future Gener. Comput. Syst..

[33]  Razvan Benchea,et al.  A practical approach on clustering malicious PDF documents , 2012, Journal in Computer Virology.

[34]  Rui Zhang,et al.  Malware identification using visualization images and deep learning , 2018, Comput. Secur..