Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families

There exist different methods of identifying malware, and widespread method is the one found in almost every antivirus solution on the market today; the signature based ap- proach. This approach uses a one-way cryptographic function to generate a unique hash of each file. Afterwards, each hash is checked against a database of hashes of known mal- ware. This method provides close to none false positives, but this does also mean that this approach can only detect previously known malware, and will in many cases also provide a number of false negatives. Malware authors exploit this weakness in the way that they change a small part of the malicious code, and thereby changes the entire hash of the file, which then leaves the malicious code undetectable until the sample is discovered, analyzed and updated in the vendors database(s). In the light of this relatively easy mit- igation for malware authors, it is clear that we need other ways to identify malware. The other two main approaches for this are static analysis and behavior based/dynamic ana- lysis. The primary goal of such analysis and previous research has been focused around detecting whether a file is malicious or benign (binary classification). There has been comprehensive work in these fields the last few years. In the work we are proposing, we will leverage results from static analysis using machine learning methods, to distin- guish malicious Windows executables. Not just benign/malicious as in many researches, but by malware family affiliation. To do this we will use a database consisting of about of 330.000 malicious executables. A challenge in this work will be the naming of the samples and families as different antivirus vendors labels samples with different names and follows no standard naming scheme. This is exemplified by e.g. the VirusTotal online scanner which scans a hash in 57 malware databases. For the static analysis we will use the VirusTotal scanner as well as an open source tool for analyzing portable executables, PEframe. The work performed in the thesis presents a novel approach to extract and construct features that can be used to make an estimation of which type and family a malicious file is an instance of, which can be useful for analysis and antivirus scanners. This contribution is novel because multinominal classification is applied to distinguish between different types and families.

[1]  Michael Ligh,et al.  Malware Analyst's Cookbook and DVD: Tools and Techniques for Fighting Malicious Code , 2010 .

[2]  Matt Bishop,et al.  The Art and Science of Computer Security , 2002 .

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[5]  Igor Kononenko,et al.  Machine Learning and Data Mining: Introduction to Principles and Algorithms , 2007 .

[6]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[7]  Simen Rune Bragen Malware detection through opcode sequence analysis using machine learning , 2015 .

[8]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[9]  Dragos Gavrilut,et al.  Malware detection using machine learning , 2009, 2009 International Multiconference on Computer Science and Information Technology.

[10]  Yuval Elovici,et al.  Unknown Malcode Detection Using OPCODE Representation , 2008, EuroISI.

[11]  Jianping Yin,et al.  Malicious Codes Detection Based on Ensemble Learning , 2007, ATC.

[12]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[13]  Nirwan Ansari,et al.  Revealing Packed Malware , 2008, IEEE Security & Privacy.

[14]  Paul D Jeanne Ellis Ormrod Leedy,et al.  Practical Research: Planning and Design , 1974 .

[15]  Aditya P. Mathur,et al.  A Survey of Malware Detection Techniques , 2007 .

[16]  Charles P. Pfleeger,et al.  Security in computing , 1988 .

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Yudong Zhang,et al.  Classification of Fruits Using Computer Vision and a Multiclass Support Vector Machine , 2012, Sensors.

[19]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[20]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[21]  Ali Hamzeh,et al.  A survey on heuristic malware detection techniques , 2013, The 5th Conference on Information and Knowledge Technology.

[22]  Robert Sabourin,et al.  “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? , 2006 .

[23]  Yuval Elovici,et al.  Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey , 2009, Inf. Secur. Tech. Rep..

[24]  Kangbin Yim,et al.  Malware Obfuscation Techniques: A Brief Survey , 2010, 2010 International Conference on Broadband, Wireless Computing, Communication and Applications.

[25]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[26]  Nitesh V. Chawla,et al.  Information Gain, Correlation and Support Vector Machines , 2006, Feature Extraction.

[27]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Lars Arne Sand Information-based dependency matching for behavioral malware analysis , 2012 .