Android malware detection with weak ground truth data

For Android malware detection, precise ground truth is a rare commodity. As security knowledge evolves, what may be considered ground truth at one moment in time may change, and apps once considered benign turn out to be malicious. The inevitable noise in data labels poses a challenge to creating effective machine learning models. Our work is focused on approaches for learning classifiers for Android malware detection in a manner that is methodologically sound with regard to the uncertain and ever-changing ground truth in the problem space. We leverage the fact that although data labels are unavoidably noisy, a malware label is much more precise than a benign label. While you can be confident that an app is malicious, you can never be certain that a benign app is really benign or just an undetected malware. Based on this insight, we leverage a modified Logistic Regression classifier that allows us to learn from only positive and unlabeled data, without making any assumptions about benign labels. We find Label Regularized Logistic Regression to perform well for noisy app datasets, as well as datasets where there is a limited amount of positive labeled data, both of which are representative of real-world situations.

[1]  Tom M. Mitchell,et al.  Weakly Supervised Extraction of Computer Security Events from Twitter , 2015, WWW.

[2]  Fabio Gagliardi Cozman,et al.  Risks of Semi-Supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers , 2006, Semi-Supervised Learning.

[3]  Peng Ning,et al.  EASEAndroid: Automatic Policy Analysis and Refinement for Security Enhanced Android via Large-Scale Semi-Supervised Learning , 2015, USENIX Security Symposium.

[4]  Peng Wang,et al.  Finding Unknown Malice in 10 Seconds: Mass Vetting for New Threats at the Google-Play Scale , 2015, USENIX Security Symposium.

[5]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[6]  Nic Herndon,et al.  Experimental Study with Real-world Data for Android App Security Analysis using Machine Learning , 2015, ACSAC.

[7]  Sheng-De Wang,et al.  Machine Learning Based Hybrid Behavior Models for Android Malware Analysis , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[8]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[9]  Michael Carl Tschantz,et al.  Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels , 2015, AISec@CCS.

[10]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[11]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[12]  Yuanyuan Zhang,et al.  A Survey of App Store Analysis for Software Engineering , 2017, IEEE Transactions on Software Engineering.

[13]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[14]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[15]  Mu Zhang,et al.  Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs , 2014, CCS.

[16]  Jason Nieh,et al.  A measurement study of google play , 2014, SIGMETRICS '14.

[17]  Anitha Ramalingam,et al.  Malware Detection in Android files based on Multiple levels of Learning and Diverse Data Sources , 2015, WCI '15.

[18]  Xun Li,et al.  Effective detection of android malware based on the usage of data flow APIs and machine learning , 2016, Inf. Softw. Technol..

[19]  Robert H. Deng,et al.  Active Semi-supervised Approach for Checking App Behavior against Its Description , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[20]  Patrick Traynor,et al.  MAST: triage for market-scale mobile malware analysis , 2013, WiSec '13.

[21]  Ali Hamzeh,et al.  A survey on heuristic malware detection techniques , 2013, The 5th Conference on Information and Knowledge Technology.