SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging

Malicious executables are programs designed to infiltrate or damage a computer system without the owner’s consent, which have become a serious threat to the security of computer systems. There is an urgent need for effective techniques to detect polymorphic, metamorphic and previously unseen malicious executables of which detection fails in most of the commercial anti-virus software. In this paper, we develop interpretable string based malware detection system (SBMDS), which is based on interpretable string analysis and uses support vector machine (SVM) ensemble with Bagging to classify the file samples and predict the exact types of the malware. Interpretable strings contain both application programming interface (API) execution calls and important semantic strings reflecting an attacker’s intent and goal. Our SBMDS is carried out with four major steps: (1) first constructing the interpretable strings by developing a feature parser; (2) performing feature selection to select informative strings related to different types of malware; (3) followed by using SVM ensemble with bagging to construct the classifier; (4) and finally conducting the malware detector, which not only can detect whether a program is malicious or not, but also can predict the exact type of the malware. Our case study on the large collection of file samples collected by Kingsoft Anti-virus lab illustrate that: (1) The accuracy and efficiency of our SBMDS outperform several popular anti-virus software; (2) Based on the signatures of interpretable strings, our SBMDS outperforms data mining based detection systems which employ single SVM, Naive Bayes with bagging, Decision Trees with bagging; (3) Compared with the IMDS which utilizes the objective-oriented association (OOA) based classification on API calls, our SBMDS achieves better performance. Our SBMDS system has already been incorporated into the scanning tool of a commercial anti-virus software.

[1]  Leonard M. Adleman,et al.  An Abstract Theory of Computer Viruses , 1988, CRYPTO.

[2]  L. M. Adleman,et al.  An abstract theory of computer viruses (invited talk) , 1990, CRYPTO 1990.

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[5]  Gary McGraw,et al.  Attacking Malicious Code: A Report to the Infosec Research Council , 2000, IEEE Software.

[6]  Bernard F. Buxton,et al.  Performance Degradation in Boosting , 2001, Multiple Classifier Systems.

[7]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[8]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[9]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[10]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[13]  Hyun-Chul Kim,et al.  Support Vector Machine Ensemble with Bagging , 2002, SVM.

[14]  Jugal K. Kalita,et al.  Efficient handling of high-dimensional feature spaces by randomized classifier ensembles , 2002, KDD.

[15]  Miguel Figueroa,et al.  Competitive learning with floating-gate circuits , 2002, IEEE Trans. Neural Networks.

[16]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[17]  Jau-Hwang Wang,et al.  Virus detection using data mining techinques , 2003, IEEE 37th Annual 2003 International Carnahan Conference onSecurity Technology, 2003. Proceedings..

[18]  Andrew H. Sung,et al.  Static analyzer of vicious executables (SAVE) , 2004, 20th Annual Computer Security Applications Conference.

[19]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[20]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[21]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[22]  T. Sejnowski,et al.  Relevance vector machine and support vector machine classifier analysis of scanning laser polarimetry retinal nerve fiber layer measurements. , 2005, Investigative ophthalmology & visual science.

[23]  Fernando Lozano,et al.  Boosting of support vector machines with application to editing , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[24]  Éric Filiol Computer Viruses: from Theory to Applications , 2005 .

[25]  Fabrizio Sebastiani Text Categorization , 2005, Encyclopedia of Database Technologies and Applications.

[26]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[27]  Moustafa Ghanem,et al.  A novel refinement approach for text categorization , 2005, CIKM '05.

[28]  Eric Filiol,et al.  On the possibility of practically obfuscating programs towards a unified perspective of code protection , 2007, Journal in Computer Virology.

[29]  Christopher Krügel,et al.  Dynamic Analysis of Malicious Code , 2006, Journal in Computer Virology.

[30]  Arun K. Pujari,et al.  N-gram analysis for computer virus detection , 2006, Journal in Computer Virology.

[31]  Dingfang Li,et al.  Feature Selection with RVM and Its Application to Prediction Modeling , 2006, Australian Conference on Artificial Intelligence.

[32]  Eric Filiol,et al.  Malware Pattern Scanning Schemes Secure Against Black-box Analysis , 2006, Journal in Computer Virology.

[33]  Eric Filiol,et al.  Evaluation methodology and theoretical model for antiviral behavioural detection strategies , 2007, Journal in Computer Virology.

[34]  S. Jha,et al.  Mining specifications of malicious behavior , 2007, ESEC-FSE '07.

[35]  Eric Filiol,et al.  Metamorphism, Formal Grammars and Undecidable Code Mutation , 2007 .

[36]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[37]  Andrew H. Sung,et al.  Boosting RVM Classifiers for Large Data Sets , 2007, ICANNGA.

[38]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[39]  Lilly Suriani Affendey,et al.  Intrusion detection using data mining techniques , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).