HDM-Analyser: a hybrid analysis approach based on data mining techniques for malware detection

Today’s security threats like malware are more sophisticated and targeted than ever, and they are growing at an unprecedented rate. To deal with them, various approaches are introduced. One of them is Signature-based detection, which is an effective method and widely used to detect malware; however, there is a substantial problem in detecting new instances. In other words, it is solely useful for the second malware attack. Due to the rapid proliferation of malware and the desperate need for human effort to extract some kinds of signature, this approach is a tedious solution; thus, an intelligent malware detection system is required to deal with new malware threats. Most of intelligent detection systems utilise some data mining techniques in order to distinguish malware from sane programs. One of the pivotal phases of these systems is extracting features from malware samples and benign ones in order to make at least a learning model. This phase is called “Malware Analysis” which plays a significant role in these systems. Since API call sequence is an effective feature for realising unknown malware, this paper is focused on extracting this feature from executable files. There are two major kinds of approach to analyse an executable file. The first type of analysis is “Static Analysis” which analyses a program in source code level. The second one is “Dynamic Analysis” that extracts features by observing program’s activities such as system requests during its execution time. Static analysis has to traverse the program’s execution path in order to find called APIs. Because it does not have sufficient information about decision making points in the given executable file, it is not able to extract the real sequence of called APIs. Although dynamic analysis does not have this drawback, it suffers from execution overhead. Thus, the feature extraction phase takes noticeable time. In this paper, a novel hybrid approach, HDM-Analyser, is presented which takes advantages of dynamic and static analysis methods for rising speed while preserving the accuracy in a reasonable level. HDM-Analyser is able to predict the majority of decision making points by utilising the statistical information which is gathered by dynamic analysis; therefore, there is no execution overhead. The main contribution of this paper is taking accuracy advantage of the dynamic analysis and incorporating it into static analysis in order to augment the accuracy of static analysis. In fact, the execution overhead has been tolerated in learning phase; thus, it does not impose on feature extraction phase which is performed in scanning operation. The experimental results demonstrate that HDM-Analyser attains better overall accuracy and time complexity than static and dynamic analysis methods.

[1]  Mourad Debbabi,et al.  Detection of Malicious Code in Cots Software: A Short Survey , 1999 .

[2]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[3]  R. Sekar,et al.  On Preventing Intrusions by Process Behavior Monitoring , 1999, Workshop on Intrusion Detection and Network Monitoring.

[4]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[5]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[6]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[7]  Jules Desharnais,et al.  Static Detection of Malicious Code in Executable Programs , 2000 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[10]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[11]  U. Bayer,et al.  TTAnalyze: A Tool for Analyzing Malware , 2006 .

[12]  Aditya P. Mathur,et al.  A Survey of Malware Detection Techniques , 2007 .

[13]  Vlado Keselj,et al.  Detection of New Malicious Code Using N-grams Signatures , 2004, PST.

[14]  Barton P. Miller,et al.  Hybrid Analysis and Control of Malware , 2010, RAID.

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[17]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[18]  Evangelos P. Markatos,et al.  Combining static and dynamic analysis for the detection of malicious documents , 2011, EUROSEC '11.

[19]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[20]  R. Dennis Cook,et al.  Cross-Validation of Regression Models , 1984 .

[21]  Jesse C. Rabek,et al.  Detection of injected, dynamically generated, and obfuscated malicious code , 2003, WORM '03.

[22]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[23]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[24]  A.H. Sung,et al.  Polymorphic malicious executable scanner by API sequence analysis , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[25]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[26]  Andrew H. Sung,et al.  Static analyzer of vicious executables (SAVE) , 2004, 20th Annual Computer Security Applications Conference.

[27]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[28]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[29]  Morgan C. Wang,et al.  Data mining methods for malware detection , 2008 .

[30]  David A. Wagner,et al.  Intrusion detection via static analysis , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[31]  David Orenstein,et al.  QuickStudy: Application Programming Interface (API) , 2000 .