Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems

Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.

[1]  Konstantin Berlin,et al.  Deep neural network based malware detection using two dimensional binary program features , 2015, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE).

[2]  Kevin Jones,et al.  Malware classification using self organising feature maps and machine activity data , 2018, Comput. Secur..

[3]  Eul Gyu Im,et al.  Malware analysis using visualized images and entropy graphs , 2015, International Journal of Information Security.

[4]  V. V. Strelkov,et al.  A new similarity measure for histogram comparison and its application in time series analysis , 2008, Pattern Recognit. Lett..

[5]  Piotr Fryzlewicz,et al.  Random Rotation Ensembles , 2016, J. Mach. Learn. Res..

[6]  Mansour Ahmadi,et al.  Microsoft Malware Classification Challenge , 2018, ArXiv.

[7]  Igor Santos,et al.  Opcode sequences as representation of executables for data-mining-based unknown malware detection , 2013, Inf. Sci..

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Zhiwen Yu,et al.  A survey on ensemble learning , 2019, Frontiers of Computer Science.

[10]  Deborah F. Swayne,et al.  Data Visualization With Multidimensional Scaling , 2008 .

[11]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[12]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[13]  Kieran McLaughlin,et al.  Detecting obfuscated malware using reduced opcode set and optimised runtime trace , 2016, Security Informatics.

[14]  Guang Chen,et al.  EntropyVis: Malware classification , 2017, 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

[15]  Claudia Eckert,et al.  Feature Selection and Extraction for Malware Classification , 2015, J. Inf. Sci. Eng..

[16]  Tomás Pevný,et al.  Multiple instance learning for malware classification , 2017, Expert Syst. Appl..

[17]  Carsten Willems,et al.  A Malware Instruction Set for Behavior-Based Analysis , 2010, Sicherheit.

[18]  Dimitris Gritzalis,et al.  Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software , 2012, Comput. Secur..

[19]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[20]  Jens Myrup Pedersen,et al.  An approach for detection and family classification of malware based on behavioral analysis , 2016, 2016 International Conference on Computing, Networking and Communications (ICNC).

[21]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[22]  Mamoun Alazab,et al.  Profiling and classifying the behavior of malicious codes , 2015, J. Syst. Softw..

[23]  Din J. Wasem Mining of Massive Datasets , 2014 .

[24]  Andreas Dewald,et al.  Cujo: efficient detection and prevention of drive-by-download attacks , 2010, ACSAC '10.

[25]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[26]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[27]  Tyler Moore,et al.  Polymorphic Malware Detection Using Sequence Classification Methods , 2016, 2016 IEEE Security and Privacy Workshops (SPW).

[28]  Christopher Krügel,et al.  A survey on automated dynamic malware-analysis techniques and tools , 2012, CSUR.

[29]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[30]  Arun Lakhotia,et al.  Malware and Machine Learning , 2015, Intelligent Methods for Cyber Warfare.

[31]  Daniel A. Keim,et al.  A Survey of Visualization Systems for Malware Analysis , 2015, EuroVis.

[32]  Yousaf Bin Zikria,et al.  Evading Virus Detection Using Code Obfuscation , 2010, FGIT.

[33]  Yuval Elovici,et al.  Detecting unknown malicious code by applying classification techniques on OpCode patterns , 2012, Security Informatics.

[34]  Alex M. Andrew,et al.  Boosting: Foundations and Algorithms , 2012 .

[35]  Xiao Luo,et al.  Investigation of malicious portable executable file detection on the network using supervised learning techniques , 2017, 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM).

[36]  Yanfang Ye,et al.  Malicious sequential pattern mining for automatic malware detection , 2016, Expert Syst. Appl..

[37]  Konrad Rieck,et al.  A close look on n-grams in intrusion detection: anomaly detection vs. classification , 2013, AISec.

[38]  Christopher Krügel,et al.  Limits of Static Analysis for Malware Detection , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[39]  Rui Zhang,et al.  Malware identification using visualization images and deep learning , 2018, Comput. Secur..

[40]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Yanhui Guo,et al.  Malware family classification method based on static feature extraction , 2017, 2017 3rd IEEE International Conference on Computer and Communications (ICCC).

[42]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[43]  Mansour Ahmadi,et al.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , 2015, CODASPY.

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  Eunjin Kim,et al.  A Novel Approach to Detect Malware Based on API Call Sequence Analysis , 2015, Int. J. Distributed Sens. Networks.

[46]  Barton P. Miller,et al.  Binary-code obfuscations in prevalent packer tools , 2013, CSUR.

[47]  Md. Rafiqul Islam,et al.  Classification of malware based on integrated static and dynamic features , 2013, J. Netw. Comput. Appl..