Comparative Study of Supervised Learning Methods for Malware Analysis

Malware is a software designed to disrupt or even damage computer system or do other unwanted actions. Nowadays, malware is a common threat of the World Wide Web. Anti-malware protection and intrusion detection can be significantly supported by a comprehensive and extensive analysis of data on the Web. The aim of such analysis is a classification of the collected data into two sets, i.e., normal and malicious data. In this paper the authors investigate the use of three supervised learning methods for data mining to support the malware detection. The results of applications of Support Vector Machine, Naive Bayes and k-Nearest Neighbors techniques to classification of the data taken from devices located in many units, organizations and monitoring systems serviced by CERT Poland are described. The performance of all methods is compared and discussed. The results of performed experiments show that the supervised learning algorithms method can be successfully used to computer data analysis, and can support computer emergency response teams in threats detection. Keywords—data classification, k-Nearest Neighbors, malware analysis, Naive Bayes, Support Vector Machine.

[1]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[2]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[3]  Alan S. Perelson,et al.  Self-nonself discrimination in a computer , 1994, Proceedings of 1994 IEEE Computer Society Symposium on Research in Security and Privacy.

[4]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[5]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[6]  Etienne Stalmans,et al.  A framework for DNS based detection and mitigation of malware infections on a network , 2011, 2011 Information Security for South Africa.

[7]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[8]  Adam Kozakiewicz,et al.  Analysis of the Similarities in Malicious DNS Domain Names , 2011 .

[9]  Zbigniew Tarapata,et al.  Graph-Based Optimization Method for Information Diffusion and Attack Durability in Networks , 2010, RSCTC.

[10]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[13]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[14]  Geoffrey M. Draper Interactive Radial Visualizations for Information Retrieval and Management , 2009 .

[15]  Radu State,et al.  Malware analysis with graph kernels and support vector machines , 2009, 2009 4th International Conference on Malicious and Unwanted Software (MALWARE).

[16]  Piotr Gajewski,et al.  Military communications and information systems interoperability , 1996, Proceedings of MILCOM '96 IEEE Military Communications Conference.

[17]  Ewa Niewiadomska-Szynkiewicz,et al.  Application of Social Network Analysis to the Investigation of Interpersonal Connections , 2012, Journal of Telecommunications and Information Technology.

[18]  Nello Cristianini,et al.  Support vector and Kernel methods , 2003 .

[19]  Muhammad Zubair Shafiq,et al.  Embedded Malware Detection Using Markov n-Grams , 2008, DIMVA.

[20]  Srinivas Mukkamala,et al.  Kernel machines for malware classification and similarity analysis , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[21]  Antoine Bordes,et al.  The Huller: A Simple and Efficient Online SVM , 2005, ECML.

[22]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[23]  Suzana Loskovska,et al.  A SURVEY OF STREAM DATA MINING , 2007 .

[24]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[25]  Hossein Saidi,et al.  Malware propagation in Online Social Networks , 2009, 2009 4th International Conference on Malicious and Unwanted Software (MALWARE).

[26]  André Ricardo Abed Grégio,et al.  A Malware Detection System Inspired on the Human Immune System , 2012, ICCSA.

[27]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[28]  Liangxiao Jiang,et al.  A Novel Bayes Model: Hidden Naive Bayes , 2009, IEEE Transactions on Knowledge and Data Engineering.

[29]  Michael I. Jordan,et al.  Computing regularization paths for learning multiple kernels , 2004, NIPS.

[30]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[31]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.