SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods

Abstract Office documents are used extensively by individuals and organizations. Most users consider these documents safe for use. Unfortunately, Office documents can contain malicious components and perform harmful operations. Attackers increasingly take advantage of naive users and leverage Office documents in order to launch sophisticated advanced persistent threat (APT) and ransomware attacks. Recently, targeted cyber-attacks against organizations have been initiated with emails containing malicious attachments. Since most email servers do not allow the attachment of executable files to emails, attackers prefer to use of non-executable files (e.g., documents) for malicious purposes. Existing anti-virus engines primarily use signature-based detection methods, and therefore fail to detect new unknown malicious code which has been embedded in an Office document. Machine learning methods have been shown to be effective at detecting known and unknown malware in various domains, however, to the best of our knowledge, machine learning methods have not been used for the detection of malicious XML-based Office documents (*.docx, *.xlsx, *.pptx, *.odt, *.ods, etc.). In this paper we present a novel structural feature extraction methodology (SFEM) for XML-based Office documents. SFEM extracts discriminative features from documents, based on their structure. We leveraged SFEM’s features with machine learning algorithms for effective detection of malicious *.docx documents. We extensively evaluated SFEM with machine learning classifiers using a representative collection (16,938 *.docx documents collected "from the wild") which contains ∼4.9% malicious and ∼95.1% benign documents. We examined 1,600 unique configurations based on different combinations of feature extraction, feature selection, feature representation, top-feature selection methods, and machine learning classifiers. The results show that machine learning algorithms trained on features provided by SFEM successfully detect new unknown malicious *.docx documents. The Random Forest classifier achieves the highest detection rates, with an AUC of 99.12% and true positive rate (TPR) of 97% that is accompanied by a false positive rate (FPR) of 4.9%. In comparison, the best anti-virus engine achieves a TPR which is ∼25% lower.

[1]  Xun Lu,et al.  De-obfuscation and Detection of Malicious PDF Files with High Accuracy , 2013, 2013 46th Hawaii International Conference on System Sciences.

[2]  Yuval Shahar,et al.  An Active Learning Framework for Efficient Condition Severity Classification , 2015, AIME.

[3]  Himanshu Pareek,et al.  Entropy and n-gram Analysis of Malicious PDF Documents , 2013 .

[4]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[5]  Giorgio Giacinto,et al.  A Pattern Recognition System for Malicious PDF Files Detection , 2012, MLDM.

[6]  Lior Rokach,et al.  Detecting unknown computer worm activity via support vector machines and active learning , 2012, Pattern Analysis and Applications.

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  Lior Rokach,et al.  ALDROID: efficient update of Android anti-virus software using designated active learning methods , 2016, Knowledge and Information Systems.

[9]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[10]  Razvan Benchea,et al.  A practical approach on clustering malicious PDF documents , 2012, Journal in Computer Virology.

[11]  Francesco Palmieri,et al.  New Steganographic Techniques for the OOXML File Format , 2011, ARES.

[12]  Stefan Berger,et al.  BISSAM: Automatic Vulnerability Identification of Office Documents , 2012, DIMVA.

[13]  Andrew Walenstein,et al.  Tracking concept drift in malware families , 2012, AISec.

[14]  Paul A. Viola,et al.  Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade , 2001, NIPS.

[15]  Giorgio Giacinto,et al.  A structural and content-based approach for a precise and robust detection of malicious PDF files , 2015, 2015 International Conference on Information Systems Security and Privacy (ICISSP).

[16]  Sotiris B. Kotsiantis,et al.  Logitboost of Simple Bayesian Classifier , 2005, Informatica.

[17]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[18]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[19]  Lior Rokach,et al.  Novel active learning methods for enhanced PC malware detection in windows OS , 2014, Expert Syst. Appl..

[20]  Steve R. White,et al.  Anatomy of a Commercial-Grade Immune System , 1999 .

[21]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[22]  Lior Rokach,et al.  Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features , 2012, J. Mach. Learn. Res..

[23]  Bo Li,et al.  Forensic investigation of OOXML format documents , 2011, Digit. Investig..

[24]  Žliobait . e,et al.  Learning under Concept Drift: an Overview , 2010 .

[25]  Yuval Elovici,et al.  Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework , 2016, Security Informatics.

[26]  Somesh Jha,et al.  Testing malware detectors , 2004, ISSTA '04.

[27]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28]  Salvatore J. Stolfo,et al.  A Study of Malcode-Bearing Documents , 2007, DIMVA.

[29]  Leyla Bilge,et al.  Cutting the Gordian Knot: A Look Under the Hood of Ransomware Attacks , 2015, DIMVA.

[30]  Yuval Shahar,et al.  Improving condition severity classification with an efficient active learning based framework , 2016, J. Biomed. Informatics.

[31]  Yuval Elovici,et al.  Unknown malcode detection and the imbalance problem , 2009, Journal in Computer Virology.

[32]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[33]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[34]  Elmar Gerhards-Padilla,et al.  PDF Scrutinizer: Detecting JavaScript-based attacks in PDF documents , 2012, 2012 Tenth Annual International Conference on Privacy, Security and Trust.

[35]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[36]  Yuval Elovici,et al.  Detection of malicious PDF files and directions for enhancements: A state-of-the art survey , 2015, Comput. Secur..

[37]  Pavel Laskov,et al.  Detection of Malicious PDF Files Based on Hierarchical Document Structure , 2013, NDSS.

[38]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[39]  Yuval Elovici,et al.  ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files , 2014, 2014 IEEE Joint Intelligence and Security Informatics Conference.

[40]  Nathalie Japkowicz,et al.  A Feature Selection and Evaluation Scheme for Computer Virus Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[41]  Lior Rokach,et al.  Improving malware detection by applying multi-inducer ensemble , 2009, Comput. Stat. Data Anal..

[42]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[43]  Evangelos P. Markatos,et al.  Combining static and dynamic analysis for the detection of malicious documents , 2011, EUROSEC '11.

[44]  Pavel Laskov,et al.  Static detection of malicious JavaScript-bearing PDF documents , 2011, ACSAC '11.