Explainable Ensemble Learning Based Detection of Evasive Malicious PDF Documents

PDF has become a major attack vector for delivering malware and compromising systems and networks, due to its popularity and widespread usage across platforms. PDF provides a flexible file structure that facilitates the embedding of different types of content such as JavaScript, encoded streams, images, executable files, etc. This enables attackers to embed malicious code as well as to hide their functionalities within seemingly benign non-executable documents. As a result, a large proportion of current automated detection systems are unable to effectively detect PDF files with concealed malicious content. To mitigate this problem, a novel approach is proposed in this paper based on ensemble learning with enhanced static features, which is used to build an explainable and robust malicious PDF document detection system. The proposed system is resilient against reverse mimicry injection attacks compared to the existing state-of-the-art learning-based malicious PDF detection systems. The recently released EvasivePDFMal2022 dataset was used to investigate the efficacy of the proposed system. Based on this dataset, an overall classification accuracy greater than 98% was observed with five ensemble learning classifiers. Furthermore, the proposed system, which employs new anomaly-based features, was evaluated on a reverse mimicry attack dataset containing three different types of content injection attacks, i.e., embedded JavaScript, embedded malicious PDF, and embedded malicious EXE. The experiments conducted on the reverse mimicry dataset showed that the Random Committee ensemble learning model achieved 100% detection rates for embedded EXE and embedded JavaScript, and 98% detection rate for embedded PDF, based on our enhanced feature set.

[1]  Lei Pan,et al.  Improving malicious PDF classifier with feature engineering: A data-driven approach , 2021, Future Gener. Comput. Syst..

[2]  Long Liu,et al.  Detection of Malicious PDF Files Using a Two‐Stage Machine Learning Algorithm , 2020 .

[3]  Shashikala Tapaswi,et al.  Malware Detection in PDF and Office Documents: A survey , 2020, Inf. Secur. J. A Glob. Perspect..

[4]  Ah Reum Kang,et al.  Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks , 2019, Secur. Commun. Networks.

[5]  Battista Biggio,et al.  Digital Investigation of PDF Files: Unveiling Traces of Embedded Malware , 2017, IEEE Security & Privacy.

[6]  P. Laskov,et al.  Hidost: a static machine-learning-based detector of malicious files , 2016, EURASIP J. Inf. Secur..

[7]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[8]  Pavel Laskov,et al.  Static detection of malicious JavaScript-bearing PDF documents , 2011, ACSAC '11.

[9]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[10]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[12]  Arash Habibi Lashkari,et al.  PDF Malware Detection based on Stacking Learning , 2022, ICISSP.

[13]  Advances in Digital Forensics XVII: 17th IFIP WG 11.9 International Conference, Virtual Event, February 1–2, 2021, Revised Selected Papers , 2021, IFIP Int. Conf. Digital Forensics.

[14]  Mohammed I. Thanoon,et al.  Toward Robust Classifiers for PDF Malware Detection , 2021, Computers, Materials & Continua.

[15]  José Torres,et al.  Malicious PDF Documents Detection using Machine Learning Techniques - A Practical Approach with Cloud Computing Applications , 2018, ICISSP.

[16]  Ali Hadi,et al.  PDF Forensic Analysis System using YARA , 2017 .