Feature Extraction Using Genetic Programming with Applications in Malware Detection

This paper extends the authors' previous research on a malware detection method, focusing on improving the accuracy of the perceptron based - One Side Class Perceptron algorithm via the use of Genetic Programming. We are concerned with finding a proper balance between the three basic requirements for malware detection algorithms: (a) that their training time on large datasets falls below acceptable upper limits; (b) that their false positive rate (clean/legitimate files/software wrongly classified as malware) is as close as possible to 0 and (c) that their detection rate is as close as possible to 1. When the first two requirements are set as objectives for the design of detection algorithms, it often happens that the third objective is missed: the detection rate is low. This study focuses on improving the detection rate while preserving the small training time and the low rate of false positives. Another concern is to use the perceptron-based algorithm's good performance on linearly separable data, by extracting features from existing ones. In order to keep the overall training time low, the huge search space of possible extracted features is efficiently explored in terms of time and memory foot-print using Genetic Programming; better separability is sought for. For experiments we used a dataset consisting of 350,000 executable files with an initial set of 300 Boolean features describing each of them. The feature-extraction algorithm is implemented in a parallel manner in order to cope with the size of the data set. We also tested different ways of controlling the growth in size of the variable-length chromosomes. The experimental results show that the features produced by this method are better than the best ones obtained through mapping allowing for an increase in detection rate.

[1]  Richard J. Enbody,et al.  Further Research on Feature Selection and Classification Using Genetic Algorithms , 1993, ICGA.

[2]  W. Punch,et al.  Feature Extraction Using Genetic Algorithms , 1997 .

[3]  Manabu Kotani,et al.  Feature Extraction Using Genetic Algorithms , 1999 .

[4]  Muhammad Zubair Shafiq,et al.  On the appropriateness of evolutionary rule learning algorithms for malware detection , 2009, GECCO '09.

[5]  Ingo Mierswa,et al.  A Hybrid Approach to Feature Selection and Generation Using an Evolutionary Algorithm , 2003 .

[6]  Henri Luchian,et al.  Feature Creation Using Genetic Algorithms for Zero False Positive Malware Classification , 2015, EVOLVE.

[7]  Santosh K. Mishra,et al.  De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures , 2007, Bioinform..

[8]  Wouter Joosen,et al.  Evolutionary algorithms for classification of malware families through different network behaviors , 2014, GECCO.

[9]  Nawwaf N. Kharma,et al.  Evolving novel image features using Genetic Programming-based image transforms , 2009, 2009 IEEE Congress on Evolutionary Computation.

[10]  Razvan Benchea,et al.  Optimized Zero False Positives Perceptron Training for Malware Detection , 2012, 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[11]  Muddassar Farooq,et al.  IMAD: in-execution malware analysis and detection , 2009, GECCO.

[12]  Asoke K. Nandi,et al.  Feature generation using genetic programming with application to fault classification , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Michael F. P. O'Boyle,et al.  Automatic Feature Generation for Machine Learning Based Optimizing Compilation , 2009, 2009 International Symposium on Code Generation and Optimization.

[14]  Gary B. Lamont,et al.  A retrovirus inspired algorithm for virus detection & optimization , 2006, GECCO.