Prescience: Probabilistic Guidance on the Retraining Conundrum for Malware Detection

Malware evolves perpetually and relies on increasingly so- phisticated attacks to supersede defense strategies. Data-driven approaches to malware detection run the risk of becoming rapidly antiquated. Keeping pace with malware requires models that are periodically enriched with fresh knowledge, commonly known as retraining. In this work, we propose the use of Venn-Abers predictors for assessing the quality of binary classification tasks as a first step towards identifying antiquated models. One of the key benefits behind the use of Venn-Abers predictors is that they are automatically well calibrated and offer probabilistic guidance on the identification of nonstationary populations of malware. Our framework is agnostic to the underlying classification algorithm and can then be used for building better retraining strategies in the presence of concept drift. Results obtained over a timeline-based evaluation with about 90K samples show that our framework can identify when models tend to become obsolete.

[1]  Tsuhan Chen,et al.  Malicious web content detection by machine learning , 2010, Expert Syst. Appl..

[2]  Mansour Ahmadi,et al.  DroidSieve: Fast and Accurate Classification of Obfuscated Android Malware , 2017, CODASPY.

[3]  Carsten Willems,et al.  Automatic analysis of malware behavior using machine learning , 2011, J. Comput. Secur..

[4]  Andrew Walenstein,et al.  Tracking concept drift in malware families , 2012, AISec.

[5]  Gianluca Dini,et al.  MADAM: Effective and Efficient Behavior-based Android Malware Detection and Prevention , 2018, IEEE Transactions on Dependable and Secure Computing.

[6]  Xiaoqian Jiang,et al.  Predicting accurate probabilities with a ranking loss , 2012, ICML.

[7]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[8]  Sarah Jane Delany,et al.  SMS spam filtering: Methods and data , 2012, Expert Syst. Appl..

[9]  Harris Papadopoulos,et al.  Artificial Intelligence Applications and Innovations: AIAI 2012 International Workshops AIAB, AIeIA, CISE, COPA, IIVC, ISQL, MHDW, and WADTMB , 2014 .

[10]  Juan E. Tapiador,et al.  Evolution, Detection and Analysis of Malware for Smart Devices , 2014, IEEE Communications Surveys & Tutorials.

[11]  Christian Platzer,et al.  MARVIN: Efficient and Comprehensive Mobile App Classification through Static and Dynamic Analysis , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[12]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[13]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[14]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[15]  Vladimir Vovk,et al.  Venn-Abers Predictors , 2012, UAI.

[16]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[17]  Jacques Klein,et al.  Machine Learning-Based Malware Detection for Android Applications: History Matters! , 2014 .

[18]  Jules White,et al.  Applying machine learning classifiers to dynamic Android malware detection at scale , 2013, 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC).

[19]  Latifur Khan,et al.  A Machine Learning Approach to Android Malware Detection , 2012, 2012 European Intelligence and Security Informatics Conference.

[20]  Yajin Zhou,et al.  Hey, You, Get Off of My Market: Detecting Malicious Apps in Official and Alternative Android Markets , 2012, NDSS.

[21]  Angelos Stavrou,et al.  Malicious PDF detection using metadata and structural features , 2012, ACSAC '12.

[22]  Vladimir Vovk,et al.  Large-scale probabilistic predictors with and without guarantees of validity , 2015, NIPS.

[23]  Sandip C. Patel,et al.  Survey of Data-mining Techniques used in Fraud Detection and Prevention , 2011 .

[24]  I. Nouretdinov,et al.  Technical Report 2016-1 — Royal Holloway , University of London Misleading Metrics : On Evaluating Machine Learning for Malware with Confidence , 2016 .

[25]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[26]  Heng Yin,et al.  DroidAPIMiner: Mining API-Level Features for Robust Malware Detection in Android , 2013, SecureComm.

[27]  Nic Herndon,et al.  Experimental Study with Real-world Data for Android App Security Analysis using Machine Learning , 2015, ACSAC.

[28]  Harris Papadopoulos,et al.  Reliable Probability Estimates Based on Support Vector Machines for Large Multiclass Datasets , 2012, AIAI.

[29]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[30]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[31]  Alessandra Gorla,et al.  Mining Apps for Abnormal Usage of Sensitive Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[32]  Seda Sahin,et al.  Hybrid expert systems: A survey of current approaches and applications , 2012, Expert Syst. Appl..

[33]  Mansour Ahmadi,et al.  Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification , 2015, CODASPY.

[34]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[35]  Mansour Ahmadi,et al.  DroidScribe: Classifying Android Malware Based on Runtime Behavior , 2016, 2016 IEEE Security and Privacy Workshops (SPW).

[36]  Wei Chen,et al.  More Semantics More Robust: Improving Android Malware Classifiers , 2016, WISEC.

[37]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[38]  Yang Liu,et al.  Adaptive and scalable Android malware detection through online learning , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[39]  W. Gasarch,et al.  The Book Review Column 1 Coverage Untyped Systems Simple Types Recursive Types Higher-order Systems General Impression 3 Organization, and Contents of the Book , 2022 .