Building a practical and reliable classifier for malware detection

Having a machine learning algorithm that can correctly classify malicious software has become a necessity as old methods of detection based on hashes and hand written heuristics tend to fail when dealing with the intensive flow of new malware. However, in order to be practical, the machine learning classifiers must also have a reasonable training time and a very small amount, preferably zero, of false positives. There were a few authors who addressed both these issues in their papers but creating such a model is more difficult when more than 3 million files are involved/needed in the training. We mapped a zero false positive perceptron in a new space, applied a feature selection algorithm and used the resulted model in an ensemble, voting or a rule based clustering system we’ve managed to achieve a detection rate around 99 % and 0.07 % false positives while keeping the training time suitable for large data sets.

[1]  Konstantin Tretyakov,et al.  Machine Learning Techniques in Spam Filtering , 2004 .

[2]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[3]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[4]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[5]  Min Zhao,et al.  SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging , 2009, Journal in Computer Virology.

[6]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[7]  Jianyong Dai,et al.  Efficient Virus Detection Using Dynamic Instruction Sequences , 2009, J. Comput..

[8]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[9]  Joshua Alspector,et al.  SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs , 2001 .

[10]  Dragos Gavrilut,et al.  Malware detection using machine learning , 2009, 2009 International Multiconference on Computer Science and Information Technology.

[11]  Anoop Sarkar,et al.  Making the most of a distributed perceptron for NLP , 2012 .

[12]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[13]  Santosh K. Mishra,et al.  De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures , 2007, Bioinform..

[14]  Michael F. P. O'Boyle,et al.  Automatic Feature Generation for Machine Learning Based Optimizing Compilation , 2009, 2009 International Symposium on Code Generation and Optimization.

[15]  Qingshan Jiang,et al.  A feature selection method for malware detection , 2011, 2011 IEEE International Conference on Information and Automation.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Razvan Benchea,et al.  Optimized Zero False Positives Perceptron Training for Malware Detection , 2012, 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[18]  Sureswaran Ramadass,et al.  Computer Virus Detection Using Features Ranking and Machine Learning , 2011 .

[19]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[20]  J. Kephart,et al.  A Feature Extraction Method and Recognition Algorithm for Detection Unknown Worm and Variations based on Static Features , 2022 .

[21]  Geoff Hulten,et al.  Learning at Low False Positive Rates , 2006, CEAS.

[22]  Jianping Yin,et al.  Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code , 2005, FSKD.