论文信息 - Using opcode sequences in single-class learning to detect unknown malware

Using opcode sequences in single-class learning to detect unknown malware

Malware is any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing at a faster rate every year and poses a serious global security threat. Although signature-based detection is the most widespread method used in commercial antivirus programs, it consistently fails to detect new malware. Supervised machine-learning models have been used to address this issue. However, the use of supervised learning is limited because it needs a large amount of malicious code and benign software to be labelled first. In this study, the authors propose a new method that uses single-class learning to detect unknown malware families. This method is based on examining the frequencies of the appearance of opcode sequences to build a machine-learning classifier using only one set of labelled instances within a specific class of either malware or legitimate software. The authors performed an empirical study that shows that this method can reduce the effort of labelling software while maintaining high accuracy.

[1] J. Kent. Information gain and a general measure of correlation , 1983 .

[2] Somesh Jha,et al. OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[3] Xiaojin Zhu,et al. --1 CONTENTS , 2006 .

[4] Chih-Ping Wei,et al. Effective spam filtering: A single-class learning and ensemble approach , 2008, Decis. Support Syst..

[5] Andy Podgurski,et al. Using dynamic information flow analysis to detect attacks against applications , 2005, SOEN.

[6] James Allan,et al. The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[7] Bernhard Schölkopf,et al. Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[8] Maya Gokhale,et al. Detecting a malicious executable without prior knowledge of its patterns , 2005, SPIE Defense + Commercial Sensing.

[9] Daniel Bilar,et al. Opcodes as predictor for malware , 2007, Int. J. Electron. Secur. Digit. Forensics.

[10] Ian T. Jolliffe,et al. Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[11] Yoseba K. Penya,et al. Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[12] Yoseba K. Penya,et al. N-grams-based File Signatures for Malware Detection , 2009, ICEIS.

[13] J. J. Rocchio,et al. Relevance feedback in information retrieval , 1971 .

[14] Alexander Zien,et al. Semi-Supervised Learning , 2006 .

[15] Robert E. Schapire,et al. The Boosting Approach to Machine Learning An Overview , 2003 .

[16] Muhammad Zubair Shafiq,et al. Embedded Malware Detection Using Markov n-Grams , 2008, DIMVA.

[17] Yuval Elovici,et al. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey , 2009, Inf. Secur. Tech. Rep..

[18] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[19] Carsten Willems,et al. Learning and Classification of Malware Behavior , 2008, DIMVA.

[20] Vinod Yegneswaran,et al. Eureka: A Framework for Enabling Static Malware Analysis , 2008, ESORICS.

[21] Marcus A. Maloof,et al. Learning to detect malicious executables in the wild , 2004, KDD.

[22] Yan Zhou,et al. Malware detection using adaptive data compression , 2008, AISec '08.

[23] Gunter Ollmann. The evolution of commercial malware development kits and colour-by-numbers custom malware , 2008 .

[24] Yuval Elovici,et al. Monitoring, analysis, and filtering system for purifying network traffic of known and unknown malicious content , 2011, Secur. Commun. Networks.

[25] Ke Wang,et al. Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[26] Yuval Elovici,et al. Unknown Malcode Detection Using OPCODE Representation , 2008, EuroISI.

[27] Wenke Lee,et al. PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[28] Xiaoli Li,et al. Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[29] Peter Szor,et al. The Art of Computer Virus Research and Defense , 2005 .

[30] Yuval Elovici,et al. Unknown malcode detection via text categorization and the imbalance problem , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[31] Yuval Elovici,et al. Unknown Malicious Code Detection – Practical Issues , 2008 .

[32] Salvatore J. Stolfo,et al. On the infeasibility of modeling polymorphic shellcode , 2007, CCS '07.

[33] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[34] Yoseba K. Penya,et al. Automatic Behaviour-based Analysis and Classification System for Malware Detection , 2010, ICEIS.

[35] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[36] Somesh Jha,et al. Behavior-based malware detection , 2007 .

[37] Rajan Chattamvelli. Data Mining Methods , 2009 .

[38] Paul Marks. Stuxnet: the new face of war , 2010 .

[39] Heng Yin,et al. Renovo: a hidden code extractor for packed executables , 2007, WORM '07.

[40] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[41] Salvatore J. Stolfo,et al. Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[42] Philip S. Yu,et al. Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[43] Aleksandar Milenkovic,et al. Using instruction block signatures to counter code injection attacks , 2005, CARN.

[44] Vladimir Vapnik,et al. The Nature of Statistical Learning , 1995 .