On concept drift, deployability, and adversarial selection in machine learning-based malware detection

Machine learning-based methods are used for malware detection due to their ability to automatically learn the detection rules from examples. The effective application of machine learning-based methods requires addressing some problems that arise due to adversarial nature of the malware domain. We address three such problems in this dissertation: concept drift, deployable classifier selection, and adversarial configuration of selection-based AV system. Concept drift results from nonstationary populations. Malware populations may not be stationary due to evolution for evading detection. Machine learning methods for malware detection assume that malware population is stationary i.e. probability distribution of the observed characteristics (features) of malware populations do not change over time. We investigate this assumption for malware families as populations. We propose two measures for tracking concept drift in malware families when feature sets are very large-relative temporal similarity and metafeatures. Our study using the proposed measures on 4000+ samples from three real world families of x86 malware, spanning over 5 years, shows negligible drift in mnemonic 2-grams extracted from unpacked versions of the samples. A novel classifier selection criterion, called deployability, is proposed. Deployability explicitly takes into account the performance target that the deployed classifier is expected to meet on unseen data. The performance target in conjunction with interval estimate of generalization performance of candidate classifiers can be used to select deployable classifiers. An evaluation of the criterion shows least expected cost classifier may not be deployable for a given cost target and higher expected cost classifiers may be deployable for a given cost target and confidence level. A game-theoretic model of dynamic classifier selection-based AV system is proposed. The model takes into accoint the possible evasion of the selector. A backward induction based equlibrium solution of the game between adversary and defender gives optimal configuration of the classifiers in the systemn for the expected cost of defender to be minimum. The solutions to each of the three problems would help in effective application of machine learning-based methods to malware detection.

[1]  Bhavani M. Thuraisingham,et al.  Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams , 2009, ECML/PKDD.

[2]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[3]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[4]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[5]  Saumya K. Debray,et al.  Obfuscation of executable code to improve resistance to static disassembly , 2003, CCS '03.

[6]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[7]  Blaine Nelson,et al.  The security of machine learning , 2010, Machine Learning.

[8]  John E. Gaffney,et al.  Evaluation of intrusion detectors: a decision theory approach , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[9]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[10]  William C. Arnold,et al.  AUTOMATICALLY GENERATED WIN32 HEURISTIC VIRUS DETECTION , 2000 .

[11]  Maya Gokhale,et al.  Detecting a malicious executable without prior knowledge of its patterns , 2005, SPIE Defense + Commercial Sensing.

[12]  John Aycock,et al.  Computer Viruses and Malware , 2006, Advances in Information Security.

[13]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[14]  Somesh Jha,et al.  A semantics-based approach to malware detection , 2007, POPL '07.

[15]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[16]  Blaine Nelson,et al.  Can machine learning be secure? , 2006, ASIACCS '06.

[17]  Somesh Jha,et al.  Semantics-aware malware detection , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[18]  Arun Lakhotia,et al.  Context-sensitive analysis of obfuscated x86 executables , 2010, PEPM '10.

[19]  Lilly Suriani Affendey,et al.  Intrusion detection using data mining techniques , 2010, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP).

[20]  Yuval Elovici,et al.  Unknown Malcode Detection Using OPCODE Representation , 2008, EuroISI.

[21]  Christopher Meek,et al.  Adversarial learning , 2005, KDD '05.

[22]  Arun Lakhotia,et al.  A method for detecting obfuscated calls in malicious binaries , 2005, IEEE Transactions on Software Engineering.

[23]  Andrew Walenstein,et al.  Malware phylogeny generation using permutations of code , 2005, Journal in Computer Virology.

[24]  Peter L. Bartlett,et al.  Open problems in the security of learning , 2008, AISec '08.

[25]  Gerald Tesauro,et al.  Neural networks for computer virus recognition , 1996 .

[26]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[27]  Vlado Keselj,et al.  N-gram-based detection of new malicious code , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[28]  Bhavani M. Thuraisingham,et al.  A Hybrid Model to Detect Malicious Executables , 2007, 2007 IEEE International Conference on Communications.

[29]  Somesh Jha,et al.  Static Analysis of Executables to Detect Malicious Patterns , 2003, USENIX Security Symposium.