A comparison of feature extraction techniques for malware analysis

The manifold growth of malware in recent years has resulted in extensive research being conducted in the domain of malware analysis and detection, and theories from a wide variety of scientific knowledge domains have been applied to solve this problem. The algorithms from the machine learning paradigm have been particularly explored, and many feature extraction methods have been proposed in the literature for representing malware as feature vectors to be used in machine learning algorithms. In this paper we present a comparison of several feature extraction techniques by first applying them on system call logs of real malware, and then evaluating them using a random forest classifier. In our experiment the HMM-based feature extraction method outperformed the other methods by obtaining an F-measure of 0.87. We also explored the possibility of using ensembles of feature extraction methods, and discovered that combination of HMM-based features with bigram frequency features improved the F-measure by 1.7%.

[1]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[2]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[3]  Swapnaja Hiray,et al.  Comparative Analysis of Feature Extraction Methods of Malware Detection , 2015 .

[4]  Claudia Eckert,et al.  Feature Selection and Extraction for Malware Classification , 2015, J. Inf. Sci. Eng..

[5]  Robert Layton,et al.  Malware Detection Based on Structural and Behavioural Features of API Calls , 2010 .

[6]  M. Serdar Bascil,et al.  A Study on Hepatitis Disease Diagnosis Using Multilayer Neural Network with Levenberg Marquardt Training Algorithm , 2011, Journal of Medical Systems.

[7]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[8]  Muhammad Abdul Qadir,et al.  Similarity-Based Malware Classification Using Hidden Markov Model , 2015, 2015 Fourth International Conference on Cyber Security, Cyber Warfare, and Digital Forensic (CyberSec).

[9]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[10]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[11]  Sureswaran Ramadass,et al.  Computer Virus Detection Using Features Ranking and Machine Learning , 2011 .

[12]  Md. Rafiqul Islam,et al.  Differentiating malware from cleanware using behavioural analysis , 2010, 2010 5th International Conference on Malicious and Unwanted Software.

[13]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[14]  A. Liu,et al.  A comparison of system call feature representations for insider threat detection , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[15]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[16]  Yoseba K. Penya,et al.  Automatic Behaviour-based Analysis and Classification System for Malware Detection , 2010, ICEIS.

[17]  Bazara I. A. Barry,et al.  Enhancing the Detection of Metamorphic Malware using Call Graphs , 2015 .

[18]  Igor Santos,et al.  Opcode sequences as representation of executables for data-mining-based unknown malware detection , 2013, Inf. Sci..

[19]  V. Rao Vemuri,et al.  Using Text Categorization Techniques for Intrusion Detection , 2002, USENIX Security Symposium.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  Carsten Willems,et al.  A Malware Instruction Set for Behavior-Based Analysis , 2010, Sicherheit.

[22]  Philip K. Chan,et al.  Learning Patterns from Unix Process Execution Traces for Intrusion Detection , 1997 .

[23]  B. S. Manjunath,et al.  Malware images: visualization and automatic classification , 2011, VizSec '11.

[24]  Mário A. T. Figueiredo,et al.  Similarity-based classification of sequences using hidden Markov models , 2004, Pattern Recognit..

[25]  Hakim Weatherspoon,et al.  Fmeter: Extracting Indexable Low-Level System Signatures by Counting Kernel Function Calls , 2012, Middleware.

[26]  Mark Stamp,et al.  Hidden Markov models for malware classification , 2015, Journal of Computer Virology and Hacking Techniques.

[27]  Joshua Saxe,et al.  Visualization of shared system call sequence relationships in large malware corpora , 2012, VizSec '12.