Acquisition of Malicious Code Using Active Learning

The recent growth in network usage has motivated the creation of new malicious code for various purposes, including economic and other malicious purposes. Currently, dozens of new malicious codes are created every day, and this number is expected to increase in coming years. Today’s signature-based anti-viruses and heuristic-based methods are accurate, but cannot detect new malicious code. Recently, classification algorithms were used successfully for the detection of malicious code. We present a complete methodology for the detection of unknown malicious code, inspired by text categorization concepts. However, this approach can be exploited further to achieve a more accurate and efficient acquisition method of unknown malicious files. We use an Active-Learning framework that enables the selection of the unknown files for fast acquisition. We performed an extensive evaluation of a test collection consisting of more than 30,000 files. We present a rigorous evaluation setup, consisting of reallife scenarios, in which the malicious file content is expected to be low, at about 10% of the files in the stream. We define specific evaluation measures based on the known precision and recall measures, which show the accuracy of the acquisition process and the improvement in the classifier resulting from the efficient acquisition process.

[1]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[2]  Nathalie Japkowicz,et al.  A Feature Selection and Evaluation Scheme for Computer Virus Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[4]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[5]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[6]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[9]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[10]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[11]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[12]  Hari Balakrishnan,et al.  Malware prevalence in the KaZaA file-sharing network , 2006, IMC '06.

[13]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[14]  Yuval Elovici,et al.  Unknown malcode detection via text categorization and the imbalance problem , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.