LUNA: Quantifying and Leveraging Uncertainty in Android Malware Analysis through Bayesian Machine Learning

Android's growing popularity seems to be hindered only by the amount of malware surfacing for this open platform. Machine learning algorithms have been successfully used for detecting the rapidly growing number of malware families appearing on a daily basis. Existing solutions along these lines, however, have a common limitation: they are all based on classical statistical inference and thus ignore the concept of uncertainty invariably involved in any prediction task. In this paper, we show that ignoring this uncertainty leads to incorrect classification of both benign and malicious apps. To reduce these errors, we utilize Bayesian machine learning – an alternative paradigm based on Bayesian statistical inference – which preserves the concept of uncertainty in all steps of calculation. We move from a black-box to a white-box approach to identify the effects different features (such as sensitive resource usage, declared activities, services and intent filters etc.) have on the classification status of an app. We show that incorporating uncertainty in the learning pipeline helps to reduce incorrect decisions, and significantly improves the accuracy of classification. We achieve a false positive rate of 0.2% compared to the previous best of 1%. We present sufficient details to allow the reader to reproduce our results through openly available probabilistic programming tools and to extend our techniques well beyond the boundaries of this paper.

[1]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[2]  Patrick D. McDaniel,et al.  On lightweight mobile phone application certification , 2009, CCS.

[3]  Paul C. van Oorschot,et al.  A methodology for empirical analysis of permission-based security models and its application to android , 2010, CCS '10.

[4]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[5]  David Huard,et al.  PyMC: Bayesian Stochastic Modelling in Python. , 2010, Journal of statistical software.

[6]  Mahmoud M. Hammad,et al.  Obfuscation-Resilient , Efficient , and Accurate Detection and Family Identification of Android Malware , 2015 .

[7]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[8]  Ninghui Li,et al.  Using probabilistic generative models for ranking risks of Android apps , 2012, CCS.

[9]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[10]  Ninghui Li,et al.  Android permissions: a perspective combining risks and benefits , 2012, SACMAT '12.

[11]  Christopher Krügel,et al.  EdgeMiner: Automatically Detecting Implicit Control Flow Transitions through the Android Framework , 2015, NDSS.

[12]  Jacques Klein,et al.  FlowDroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps , 2014, PLDI.

[13]  Isil Dillig,et al.  Apposcopy: semantics-based detection of Android malware through static analysis , 2014, SIGSOFT FSE.

[14]  Julia Rubin,et al.  A Bayesian Approach to Privacy Enforcement in Smartphones , 2014, USENIX Security Symposium.

[15]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[16]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[17]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[18]  Mansour Ahmadi,et al.  DroidScribe: Classifying Android Malware Based on Runtime Behavior , 2016, 2016 IEEE Security and Privacy Workshops (SPW).

[19]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[20]  Thomas V. Wiecki,et al.  Probabilistic Programming in Python using PyMC , 2015, 1507.08050.

[21]  Heng Yin,et al.  DroidScope: Seamlessly Reconstructing the OS and Dalvik Semantic Views for Dynamic Android Malware Analysis , 2012, USENIX Security Symposium.

[22]  M. J. D. Powell,et al.  A fast algorithm for nonlinearly constrained optimization calculations , 1978 .

[23]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[24]  Jeff H. Perkins,et al.  Information Flow Analysis of Android Applications in DroidSafe , 2015, NDSS.

[25]  Patrick D. McDaniel,et al.  Understanding Android Security , 2009, IEEE Security & Privacy Magazine.