An Optimized Positive-Unlabeled Learning Method for Detecting a Large Scale of Malware Variants

Malicious softwares (Malware) are able to quickly evolve into many different variants and evade existing detection mechanisms, rendering the ineffectiveness of traditional signature-based malware detection systems. Many researchers have proposed advanced malware detection techniques by using Machine Learning. Although the machine learning based techniques perform well in detecting a wide range of malware variants, there still remain some problems when meeting the real scene in the industry. Since the volume of new malware variants grows fast and labelling data is expensive and takes a lot of labor, companies cannot label every one of those samples. They tend to label a small part of the malware samples and treat the rest of the unlabeled samples as benign samples in which the original malware samples are treated as mislabeled. This causes a bias of decision boundary which severely limits the accuracy. To address such a problem, in this paper, we propose a cost-sensitive boosting method to train an unbiased detection model with the malicious-unlabeled executables to improve the accuracy. Along with that, in order to detect malware variants efficiently, we propose a byte co-occurrence matrix as a representation of byte streams of executables to detect malware variants directly. Experimental results show that the machine learning methods optimized by our approach can achieve 80% to 90% accuracy while the original machine learning methods can only achieve 50% to 85% accuracy when the unlabeled data contain different rates of mislabeled positive data.

[1]  Zheng Qin,et al.  A feature-hybrid malware variants detection using CNN based opcode embedding and BPNN based API embedding , 2019, Comput. Secur..

[2]  Qinghua Zheng,et al.  Android Malware Familial Classification and Representative Sample Selection via Frequent Subgraph Analysis , 2018, IEEE Transactions on Information Forensics and Security.

[3]  Gianluca Stringhini,et al.  Marmite: Spreading Malicious File Reputation Through Download Graphs , 2017, ACSAC.

[4]  Sakir Sezer,et al.  N-opcode analysis for android malware classification and categorization , 2016, 2016 International Conference On Cyber Security And Protection Of Digital Services (Cyber Security).

[5]  Jon Barker,et al.  Malware Detection by Eating a Whole EXE , 2017, AAAI Workshops.

[6]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[7]  Dacheng Tao,et al.  Multi-Positive and Unlabeled Learning , 2017, IJCAI.

[8]  Adam Doupé,et al.  Deep Android Malware Detection , 2017, CODASPY.

[9]  Divya Bansal,et al.  Malware Analysis and Classification: A Survey , 2014 .

[10]  Chengqi Zhang,et al.  Similarity-Based Approach for Positive and Unlabeled Learning , 2011, IJCAI.

[11]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[12]  Zheng Qin,et al.  Dalvik Opcode Graph Based Android Malware Variants Detection Using Global Topology Features , 2018, IEEE Access.

[13]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[14]  Wenke Lee,et al.  PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[15]  Wanlei Zhou,et al.  Control Flow-Based Malware VariantDetection , 2014, IEEE Transactions on Dependable and Secure Computing.

[16]  Jian Yang,et al.  Large-Margin Label-Calibrated Support Vector Machines for Positive and Unlabeled Learning , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Jian Liu,et al.  Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the Wild , 2018, SecureComm.

[18]  Wei Zhang,et al.  Securing Consumer IoT in the Smart Home: Architecture, Challenges, and Countermeasures , 2018, IEEE Wireless Communications.