Semi-supervised Learning for Unknown Malware Detection

Malware is any kind of computer software potentially harmful to both computers and networks. The amount of malware is increasing every year and poses a serious global security threat. Signature-based detection is the most widely used commercial antivirus method, however, it consistently fails to detect new malware. Supervised machine-learning models have been used to solve this issue, but the usefulness of supervised learning is far to be perfect because it requires that a significant amount of malicious code and benign software to be identified and labelled beforehand. In this paper, we propose a new method of malware protection that adopts a semi-supervised learning approach to detect unknown malware. This method is designed to build a machine-learning classifier using a set of labelled (malware and legitimate software) and unlabelled instances.We performed an empirical validation demonstrating that the labelling efforts are lower than when supervised learning is used, while maintaining high accuracy rates.

[1]  Yoseba K. Penya,et al.  N-grams-based File Signatures for Malware Detection , 2009, ICEIS.

[2]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[3]  Úlfar Erlingsson,et al.  Engineering Secure Software and Systems , 2011, Lecture Notes in Computer Science.

[4]  Yoseba K. Penya,et al.  Idea: Opcode-Sequence-Based Malware Detection , 2010, ESSoS.

[5]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[6]  Vinod Yegneswaran,et al.  Eureka: A Framework for Enabling Static Malware Analysis , 2008, ESORICS.

[7]  Marcus A. Maloof,et al.  Learning to detect malicious executables in the wild , 2004, KDD.

[8]  Gunter Ollmann The evolution of commercial malware development kits and colour-by-numbers custom malware , 2008 .

[9]  Muhammad Zubair Shafiq,et al.  Embedded Malware Detection Using Markov n-Grams , 2008, DIMVA.

[10]  Pavel Laskov,et al.  Detection of Intrusions and Malware, and Vulnerability Assessment: 19th International Conference, DIMVA 2022, Cagliari, Italy, June 29 –July 1, 2022, Proceedings , 2022, International Conference on Detection of intrusions and malware, and vulnerability assessment.

[11]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[12]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[13]  Wenke Lee,et al.  PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[14]  Heng Yin,et al.  Renovo: a hidden code extractor for packed executables , 2007, WORM '07.

[15]  Arvinder Kaur,et al.  Comparative analysis of regression and machine learning methods for predicting fault proneness models , 2009, Int. J. Comput. Appl. Technol..

[16]  Martín Abadi,et al.  Code-Carrying Authorization , 2008, ESORICS.

[17]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  Yuval Elovici,et al.  Unknown malcode detection via text categorization and the imbalance problem , 2008, 2008 IEEE International Conference on Intelligence and Security Informatics.

[20]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[21]  Somesh Jha,et al.  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware , 2007, Twenty-Third Annual Computer Security Applications Conference (ACSAC 2007).

[22]  Yan Zhou,et al.  Malware detection using adaptive data compression , 2008, AISec '08.