Methodology for Behavioral-based Malware Analysis and Detection Using Random Projections and K-Nearest Neighbors Classifiers

In this paper, a two-stage methodology to analyze and detect behavioral-based malware is presented. In the first stage, a random projection is decreasing the variable dimensionality of the problem and is simultaneously reducing the computational time of the classification task by several orders of magnitude. In the second stage, a modified K-Nearest Neighbors classifier is used with Virus Total labeling of the file samples. This methodology is applied to a large number of file samples provided by F-Secure Corporation, for which a dynamic feature has been extracted during Deep Guard sandbox execution. As a result, the files classified as false negatives are used to detect possible malware that were not detected in the first place by Virus Total. The reduced number of selected false negatives allows the manual inspection by a human expert.

[1]  Alberto Guillén,et al.  OP-KNN: Method and Applications , 2010, Adv. Artif. Neural Syst..

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[4]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[5]  Li Sun,et al.  Pattern Recognition Techniques for the Classification of Malware Packers , 2010, ACISP.

[6]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[7]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[8]  Joris Kinable,et al.  Malware classification based on call graph clustering , 2010, Journal in Computer Virology.

[9]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[10]  Michel Verleysen,et al.  Model Selection with Cross-Validations and Bootstraps - Application to Time Series Prediction with RBFN Models , 2003, ICANN.

[11]  Liwei Zhang,et al.  Detecting Trojan horses based on system behavior using machine learning method , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[12]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[13]  Alva Erwin,et al.  Analysis of Machine learning Techniques Used in Behavior-Based Malware Detection , 2010, 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[14]  Zhuoqing Morley Mao,et al.  Automated Classification and Analysis of Internet Malware , 2007, RAID.

[15]  Tsutomu Matsumoto,et al.  Vulnerability in Public Malware Sandbox Analysis Systems , 2010, 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet.

[16]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[17]  Felix C. Freiling,et al.  Toward Automated Dynamic Malware Analysis Using CWSandbox , 2007, IEEE Secur. Priv..

[18]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[19]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[20]  Amaury Lendasse,et al.  OP-ELM: Optimally Pruned Extreme Learning Machine , 2010, IEEE Transactions on Neural Networks.

[21]  Lior Rokach,et al.  Improving malware detection by applying multi-inducer ensemble , 2009, Comput. Stat. Data Anal..

[22]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[23]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[24]  Abhinav Srivastava,et al.  Automatic Discovery of Parasitic Malware , 2010, RAID.

[25]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..