ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files

Email communication carrying malicious attachments or links is often used as an attack vector for initial penetration of the targeted organization. Existing defense solutions prevent executables from entering organizational networks via emails, therefore recent attacks tend to use non-executable files such as PDF. Machine learning algorithms have recently been applied for detecting malicious PDF files. These techniques, however, lack an essential element - they cannot be updated daily. In this study we present ALPD, a framework that is based on active learning methods that are specially designed to efficiently assist anti-virus vendors to focus their analytical efforts. This is done by identifying and acquiring new PDF files that are most likely malicious, as well as informative benign PDF documents. These files are used for retraining and enhancing the knowledge stores. Evaluation results show that in the final day of the experiment, Combination, one of our AL methods, outperformed all the others, enriching the anti-virus's signature repository with almost seven times more new PDF malware while also improving the detection model's performance on a daily basis.

[1]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[2]  Giorgio Giacinto,et al.  A Pattern Recognition System for Malicious PDF Files Detection , 2012, MLDM.

[3]  Jarle Kittilsen,et al.  Detecting malicious PDF documents , 2011 .

[4]  Pavel Laskov,et al.  Detection of Malicious PDF Files Based on Hierarchical Document Structure , 2013, NDSS.

[5]  Xun Lu,et al.  De-obfuscation and Detection of Malicious PDF Files with High Accuracy , 2013, 2013 46th Hawaii International Conference on System Sciences.

[6]  Kiem Hoang,et al.  A Machine Learning Approach to Anti - virus System , 2004 .

[7]  Paul Baccas FINDING RULES FOR HEURISTIC DETECTION OF MALICIOUS PDFS : WITH ANALYSIS OF EMBEDDED EXPLOIT CODE , 2010 .

[8]  Niels Provos,et al.  SHELLOS: Enabling Fast Detection and Forensic Analysis of Code Injection Attacks , 2011, USENIX Security Symposium.

[9]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Razvan Benchea,et al.  A practical approach on clustering malicious PDF documents , 2012, Journal in Computer Virology.

[12]  Giorgio Giacinto,et al.  Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious PDF files detection , 2013, ASIA CCS '13.

[13]  Evangelos P. Markatos,et al.  Combining static and dynamic analysis for the detection of malicious documents , 2011, EUROSEC '11.

[14]  Yuval Elovici,et al.  Malicious Code Detection Using Active Learning , 2009, PinKDD.

[15]  Lior Rokach,et al.  Detection of unknown computer worms based on behavioral classification of the host , 2008, Comput. Stat. Data Anal..

[16]  Yuval Elovici,et al.  Unknown malcode detection and the imbalance problem , 2009, Journal in Computer Virology.

[17]  Xun Wang,et al.  Detecting worms via mining dynamic program execution , 2007, 2007 Third International Conference on Security and Privacy in Communications Networks and the Workshops - SecureComm 2007.

[18]  Lior Rokach,et al.  Detecting unknown computer worm activity via support vector machines and active learning , 2012, Pattern Analysis and Applications.

[19]  Lior Rokach,et al.  Novel active learning methods for enhanced PC malware detection in windows OS , 2014, Expert Syst. Appl..

[20]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[21]  Karsten P. Ulland,et al.  Vii. References , 2022 .

[22]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[23]  Pavel Laskov,et al.  Static detection of malicious JavaScript-bearing PDF documents , 2011, ACSAC '11.

[24]  Colin Campbell,et al.  Bayes Point Machines , 2001, J. Mach. Learn. Res..

[25]  Yuan Zhang,et al.  Malware characteristics and threats on the internet ecosystem , 2012, J. Syst. Softw..

[26]  Elmar Gerhards-Padilla,et al.  PDF Scrutinizer: Detecting JavaScript-based attacks in PDF documents , 2012, 2012 Tenth Annual International Conference on Privacy, Security and Trust.

[27]  Himanshu Pareek,et al.  Entropy and n-gram Analysis of Malicious PDF Documents , 2013 .

[28]  Sanjay Singh,et al.  Threat Analysis and malicious user detection in reputation systems using Mean Bisector Analysis and Cosine Similarity (MBACS) , 2013, 2013 Annual IEEE India Conference (INDICON).