Mobile Malware Detection with Imbalanced Data using a Novel Synthetic Oversampling Strategy and Deep Learning

Mobile malware detection is inherently an imbalanced data problem since the number of benign applications in the market is far greater than the number of malicious applications. Existing methods to handle imbalanced data, such as synthetic minority over-sampling, do not translate well into this domain since mobile malware detection generally deals with binary features and these methods are designed for continuous features. Also, methods adapted for categorical features cannot be applied here since random modifications of features can result in invalid sample generation. In this work, we propose a novel technique for generating synthetic samples for mobile malware detection with imbalanced data. Our proposed method adds new data points in the sample space by generating synthetic malware samples which also preserves the original functionality of the malicious apps. Experiments show that the proposed approach outperforms existing techniques in terms of precision, recall, F1score, and AUC. This study will be useful in building deep neural network-based systems to handle imbalanced data for mobile malware detection.

[1]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Joyce A. Mitchell,et al.  Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery , 2009, J. Biomed. Informatics.

[3]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[4]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[5]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[6]  Lin Wang,et al.  Machine learning based mobile malware detection using highly imbalanced network traffic , 2017, Inf. Sci..

[7]  Jerzy Stefanowski,et al.  Neighbourhood sampling in bagging for imbalanced data , 2015, Neurocomputing.

[8]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[9]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[10]  Qi Li,et al.  Android Malware Detection Based on Static Analysis of Characteristic Tree , 2015, 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[11]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[12]  Taghi M. Khoshgoftaar,et al.  Big Data fraud detection using multiple medicare data sources , 2018, J. Big Data.

[13]  Longbing Cao,et al.  Training deep neural networks on imbalanced data sets , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[14]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[17]  Songqing Yue,et al.  Imbalanced Malware Images Classification: a CNN based Approach , 2017, ArXiv.

[18]  Yixian Yang,et al.  Fuzzy–synthetic minority oversampling technique: Oversampling based on fuzzy set theory for Android malware detection in imbalanced datasets , 2017, Int. J. Distributed Sens. Networks.

[19]  Dafang Zhang,et al.  A Deep Learning Approach to Android Malware Feature Learning and Detection , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[20]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[21]  Carolyn Pillers Dobler The Practice of Statistics , 2003 .

[22]  Wenjia Li,et al.  DroidDeepLearner: Identifying Android malware using deep learning , 2016, 2016 IEEE 37th Sarnoff Symposium.

[23]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[24]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[25]  Zhenlong Yuan,et al.  Droid-Sec: deep learning in android malware detection , 2015, SIGCOMM 2015.

[26]  Taghi M. Khoshgoftaar,et al.  Survey on deep learning with class imbalance , 2019, J. Big Data.

[27]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.