Efficient Malware Packer Identification Using Support Vector Machines with Spectrum Kernel

Packing is among the most popular obfuscation techniques to impede anti-virus scanners from successfully detecting malware. Efficient and automatic packer identification is an essential step to perform attack on ever increasing malware databases. In this paper we present a p-spectrum induced linear Support Vector Machine to implement an automated packer identification with good accuracy and scalability. The efficacy and efficiency of the method is evaluated on a dataset composed of 3228 packed files created by 25 packers with near-perfect identification results reported. This method can help to improve the scanning efficiency of anti-virus products and ease efficient back-end malware research.

[1]  Gérard Dreyfus,et al.  Single-layer learning revisited: a stepwise procedure for building and training a neural network , 1989, NATO Neurocomputing.

[2]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Shanqing Guo,et al.  Application of string kernel based support vector machine for malware packer identification , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[5]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[6]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[7]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[8]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[9]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[10]  Li Sun,et al.  Pattern Recognition Techniques for the Classification of Malware Packers , 2010, ACISP.

[11]  Sören Sonnenburg,et al.  COFFIN: A Computational Framework for Linear SVMs , 2010, ICML.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[15]  Tzi-cker Chiueh,et al.  A Study of the Packer Problem and Its Solutions , 2008, RAID.

[16]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[17]  Ferran Reverter,et al.  A Comparison of Spectrum Kernel Machines for Protein Subnuclear Localization , 2011, IbPRIA.

[18]  Robert Lyda,et al.  Using Entropy Analysis to Find Encrypted and Packed Malware , 2007, IEEE Security & Privacy.

[19]  Wenke Lee,et al.  Classification of packed executables for accurate computer virus detection , 2008, Pattern Recognit. Lett..