Using opcode sequences in single-class learning to detect unknown malware

Malware is any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing at a faster rate every year and poses a serious global security threat. Although signature-based detection is the most widespread method used in commercial antivirus programs, it consistently fails to detect new malware. Supervised machine-learning models have been used to address this issue. However, the use of supervised learning is limited because it needs a large amount of malicious code and benign software to be labelled first. In this study, the authors propose a new method that uses single-class learning to detect unknown malware families. This method is based on examining the frequencies of the appearance of opcode sequences to build a machine-learning classifier using only one set of labelled instances within a specific class of either malware or legitimate software. The authors performed an empirical study that shows that this method can reduce the effort of labelling software while maintaining high accuracy.

[1]  J. Kent Information gain and a general measure of correlation , 1983 .

[2]  Somesh Jha,et al.  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[3]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[4]  Chih-Ping Wei,et al.  Effective spam filtering: A single-class learning and ensemble approach , 2008, Decis. Support Syst..

[5]  Andy Podgurski,et al.  Using dynamic information flow analysis to detect attacks against applications , 2005, SOEN.

[6]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[7]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[8]  Maya Gokhale,et al.  Detecting a malicious executable without prior knowledge of its patterns , 2005, SPIE Defense + Commercial Sensing.

[9]  Daniel Bilar,et al.  Opcodes as predictor for malware , 2007, Int. J. Electron. Secur. Digit. Forensics.

[10]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[11]  Yoseba K. Penya,et al.  Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[12]  Yoseba K. Penya,et al.  N-grams-based File Signatures for Malware Detection , 2009, ICEIS.

[13]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[14]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[15]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[16]  Muhammad Zubair Shafiq,et al.  Embedded Malware Detection Using Markov n-Grams , 2008, DIMVA.

[17]  Yuval Elovici,et al.  Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey , 2009, Inf. Secur. Tech. Rep..

[18]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[19]  Carsten Willems,et al.  Learning and Classification of Malware Behavior , 2008, DIMVA.

[20]  Vinod Yegneswaran,et al.  Eureka: A Framework for Enabling Static Malware Analysis , 2008, ESORICS.

[21]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[22]  Yan Zhou,et al.  Malware detection using adaptive data compression , 2008, AISec '08.

[23]  Gunter Ollmann The evolution of commercial malware development kits and colour-by-numbers custom malware , 2008 .

[24]  Yuval Elovici,et al.  Monitoring, analysis, and filtering system for purifying network traffic of known and unknown malicious content , 2011, Secur. Commun. Networks.

[25]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[26]  Yuval Elovici,et al.  Unknown Malcode Detection Using OPCODE Representation , 2008, EuroISI.

[27]  Wenke Lee,et al.  PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[28]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[29]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[30]  Yuval Elovici,et al.  Unknown malcode detection via text categorization and the imbalance problem , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[31]  Yuval Elovici,et al.  Unknown Malicious Code Detection – Practical Issues , 2008 .

[32]  Salvatore J. Stolfo,et al.  On the infeasibility of modeling polymorphic shellcode , 2007, CCS '07.

[33]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[34]  Yoseba K. Penya,et al.  Automatic Behaviour-based Analysis and Classification System for Malware Detection , 2010, ICEIS.

[35]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[36]  Somesh Jha,et al.  Behavior-based malware detection , 2007 .

[37]  Rajan Chattamvelli Data Mining Methods , 2009 .

[38]  Paul Marks Stuxnet: the new face of war , 2010 .

[39]  Heng Yin,et al.  Renovo: a hidden code extractor for packed executables , 2007, WORM '07.

[40]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[41]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[42]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[43]  Aleksandar Milenkovic,et al.  Using instruction block signatures to counter code injection attacks , 2005, CARN.

[44]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .