Machine Learning-Based Malware Detection for Android Applications: History Matters!

Machine Learning-based malware detection is a promising scalable method for identifying suspicious applications. In particular, in today’s mobile computing realm where thousands of applications are daily poured into markets, such a technique could be valuable to guarantee a strong filtering of malicious apps. The success of machine-learning approaches however is highly dependent on (1) the quality of the datasets that are used for training and of (2) the appropriateness of the tested datasets with regards to the built classifiers. Unfortunately, there is scarce mention of these aspects in the evaluation of existing state-of-the-art approaches in the literature. In this paper, we consider the relevance of history in the construction of datasets, to highlight its impact on the performance of the malware detection scheme. Typically, we show that simply picking a random set of known malware to train a malware detector, as it is done in most assessment scenarios from the literature, yields significantly biased results. In the process of assessing the extent of this impact through various experiments, we were also able to confirm a number of intuitive assumptions about Android malware. For instance, we discuss the existence of Android malware lineages and how they could impact the performance of malware detection in the wild.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Yoseba K. Penya,et al.  N-grams-based File Signatures for Malware Detection , 2009, ICEIS.

[3]  Sakir Sezer,et al.  A New Android Malware Detection Approach Using Bayesian Classification , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[4]  Yves Le Traon,et al.  Automatically securing permission-based software by reducing the attack surface: an application to Android , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[5]  Axelle Apvrille,et al.  Reducing the window of opportunity for Android malware Gotta catch ’em all , 2012, Journal in Computer Virology.

[6]  Lior Rokach,et al.  Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features , 2012, J. Mach. Learn. Res..

[7]  David Lo,et al.  Empirical Evaluation of Bug Linking , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[8]  Jianping Yin,et al.  Malicious Codes Detection Based on Ensemble Learning , 2007, ATC.

[9]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[10]  Swarat Chaudhuri,et al.  A Study of Android Application Security , 2011, USENIX Security Symposium.

[11]  Heng Yin,et al.  DroidAPIMiner: Mining API-Level Features for Robust Malware Detection in Android , 2013, SecureComm.

[12]  T. Moore Challenges in empirical security research , 2012 .

[13]  Aditya P. Mathur,et al.  A Survey of Malware Detection Techniques , 2007 .

[14]  Heloise Pieterse,et al.  Android botnets on the rise: Trends and characteristics , 2012, 2012 Information Security for South Africa.

[15]  Wenke Lee,et al.  McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables , 2008, 2008 Annual Computer Security Applications Conference (ACSAC).

[16]  Jeffrey O. Kephart,et al.  A biologically inspired immune system for computers , 1994 .

[17]  Ninghui Li,et al.  Using probabilistic generative models for ranking risks of Android apps , 2012, CCS.

[18]  Salvatore J. Stolfo,et al.  On the feasibility of online malware detection with performance counters , 2013, ISCA.

[19]  Yang Xiang,et al.  Classification of malware using structured control flow , 2010 .

[20]  Gerardo Canfora,et al.  A Classifier of Malicious Android Applications , 2013, 2013 International Conference on Availability, Reliability and Security.

[21]  M. Chuah,et al.  Smartphone Dual Defense Protection Framework: Detecting Malicious Applications in Android Markets , 2012, 2012 8th International Conference on Mobile Ad-hoc and Sensor Networks (MSN).

[22]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[23]  Patrick Traynor,et al.  MAST: triage for market-scale mobile malware analysis , 2013, WiSec '13.

[24]  Nathalie Japkowicz,et al.  A Feature Selection and Evaluation Scheme for Computer Virus Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[25]  Paul C. van Oorschot,et al.  A methodology for empirical analysis of permission-based security models and its application to android , 2010, CCS '10.

[26]  Hahn-Ming Lee,et al.  DroidMat: Android Malware Detection through Manifest and API Calls Tracing , 2012, 2012 Seventh Asia Joint Conference on Information Security.

[27]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[28]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[29]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[30]  Wenke Lee,et al.  Classification of packed executables for accurate computer virus detection , 2008, Pattern Recognit. Lett..

[31]  Gerardo Canfora,et al.  An Empirical Study of Metric-Based Methods to Detect Obfuscated Code , 2013 .

[32]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining for Malware Detection , 2013 .

[33]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.

[34]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Jules White,et al.  Applying machine learning classifiers to dynamic Android malware detection at scale , 2013, 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC).

[37]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[38]  Latifur Khan,et al.  A Machine Learning Approach to Android Malware Detection , 2012, 2012 European Intelligence and Security Informatics Conference.

[39]  Jacques Klein,et al.  Large-scale machine learning-based malware detection: confronting the "10-fold cross validation" scheme with reality , 2014, CODASPY '14.

[40]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..