KronoDroid: Time-based Hybrid-featured Dataset for Effective Android Malware Detection and Characterization

Abstract Android malware evolution has been neglected by the available data sets, thus providing a static snapshot of a non-stationary phenomenon. The impact of the time variable has not had the deserved attention by the Android malware research, omitting its degenerative impact on the performance of machine learning- based classifiers (i.e., concept drift). Besides, the sources of dynamic data and their particularities have been overlooked (i.e., real devices and emulators). Critical factors to take into account when aiming to build more effective, robust, and long-lasting Android malware detection systems. In this research, different sources of benign and malware data are merged, generating a data set encompassing a larger time frame and 489 static and dynamic features are collected. The particularities of the source of the dynamic features (i.e., system calls) are attended using an emulator and a real device, thus generating two equally featured sub-datasets. The main outcome of this research is a novel, labeled, and hybrid-featured Android dataset that provides timestamps for each data sample, covering all years of Android history, from 2008-2020, and considering the distinct dynamic data sources. The emulator data set is composed of 28,745 malicious apps from 209 malware families and 35,246 benign samples. The real device data set contains 41,382 malware, belonging to 240 malware families, and 36,755 benign apps. Made publicly available as KronoDroid, in a structured format, it is the largest hybrid-featured Android dataset and the only one providing timestamped data, considering dynamic sources’ particularities and including samples for over 209 Android malware families.

[1]  Haipeng Cai,et al.  On the Deterioration of Learning-Based Malware Detectors for Android , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[2]  Sven Nõmm,et al.  Time-frame Analysis of System Calls Behavior in Machine Learning-Based Mobile Malware Detection , 2019, 2019 International Conference on Cyber Security for Emerging Technologies (CSET).

[3]  Peipei Li,et al.  The Concept Drift Problem in Android Malware Detection and Its Solution , 2017, Secur. Commun. Networks.

[4]  Marco Cesati,et al.  Understanding the Linux Kernel - from I / O ports to process management: covers Linux Kernel version 2.4 (2. ed.) , 2005 .

[5]  Yanfang Ye,et al.  Deep4MalDroid: A Deep Learning Framework for Android Malware Detection Based on Linux Kernel System Call Graphs , 2016, 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW).

[6]  Henry Leung,et al.  Adversarial-Example Attacks Toward Android Malware Detection System , 2020, IEEE Systems Journal.

[7]  Qi Li,et al.  EveDroid: Event-Aware Android Malware Detection Against Model Degrading for IoT Devices , 2019, IEEE Internet of Things Journal.

[8]  Ali Feizollah,et al.  AndroDialysis: Analysis of Android Intent Effectiveness in Malware Detection , 2017, Comput. Secur..

[9]  Paul Irolla,et al.  The duplication issue within the Drebin dataset , 2018, Journal of Computer Virology and Hacking Techniques.

[10]  Xingquan Zhu,et al.  Machine Learning for Android Malware Detection Using Permission and API Calls , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[11]  Sahin Albayrak,et al.  An Android Application Sandbox system for suspicious software detection , 2010, 2010 5th International Conference on Malicious and Unwanted Software.

[12]  Xuehui Du,et al.  Android Malware Detection Based on Structural Features of the Function Call Graph , 2021, Electronics.

[13]  Anshul Arora,et al.  Malware Detection Using Network Traffic Analysis in Android Based Mobile Devices , 2014, 2014 Eighth International Conference on Next Generation Mobile Apps, Services and Technologies.

[14]  Roland H. C. Yap,et al.  Inferring the Detection Logic and Evaluating the Effectiveness of Android Anti-Virus Apps , 2016, CODASPY.

[15]  Yang Liu,et al.  Adaptive and scalable Android malware detection through online learning , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[16]  Vijay Laxmi,et al.  AndroSimilar: robust statistical feature signature for Android malware detection , 2013, SIN.

[17]  Yuval Elovici,et al.  “Andromaly”: a behavioral malware detection framework for android devices , 2012, Journal of Intelligent Information Systems.

[18]  Ke Xu,et al.  DroidEvolver: Self-Evolving Android Malware Detection System , 2019, 2019 IEEE European Symposium on Security and Privacy (EuroS&P).

[19]  Ali A. Ghorbani,et al.  Towards a Network-Based Framework for Android Malware Detection and Characterization , 2017, 2017 15th Annual Conference on Privacy, Security and Trust (PST).

[20]  Gianluca Stringhini,et al.  MaMaDroid , 2019, ACM Trans. Priv. Secur..

[21]  Patrick D. McDaniel,et al.  On lightweight mobile phone application certification , 2009, CCS.

[22]  Muttukrishnan Rajarajan,et al.  Investigating the android intents and permissions for malware detection , 2014, 2014 IEEE 10th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob).

[23]  Valérie Viet Triem Tong,et al.  Kharon dataset: Android malware under a microscope , 2016 .

[24]  Ilia Nouretdinov,et al.  Transcend: Detecting Concept Drift in Malware Classification Models , 2017, USENIX Security Symposium.

[25]  Kabakus Abdullah Talha,et al.  APK Auditor: Permission-based Android malware detection system , 2015 .

[26]  Haipeng Cai,et al.  AndroCT: Ten Years of App Call Traces in Android , 2021, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).

[27]  Aubrey-Derrick Schmidt,et al.  Detection of Smartphone Malware , 2011 .

[28]  Yongqiang Lyu,et al.  Droid-Sec , 2014, SIGCOMM.

[29]  Di Wu,et al.  DeepFlow: Deep learning-based malware detection by mining Android application for abnormal usage of sensitive data , 2017, 2017 IEEE Symposium on Computers and Communications (ISCC).

[30]  Sotiris Ioannidis,et al.  Rage against the virtual machine: hindering dynamic analysis of Android malware , 2014, EuroSec '14.

[31]  Sankardas Roy,et al.  Deep Ground Truth Analysis of Current Android Malware , 2017, DIMVA.

[32]  Haipeng Cai,et al.  A Longitudinal Study of Application Structure and Behaviors in Android , 2020, IEEE Transactions on Software Engineering.

[33]  Ken Dunham,et al.  Android Malware and Analysis , 2014 .

[34]  Steve Hanna,et al.  Android permissions demystified , 2011, CCS '11.

[35]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[36]  Gianluca Dini,et al.  MADAM: A Multi-level Anomaly Detector for Android Malware , 2012, MMM-ACNS.

[37]  Heng Li,et al.  Learning features from enhanced function call graphs for Android malware detection , 2021, Neurocomputing.

[38]  Haipeng Cai,et al.  Assessing and Improving Malware Detection Sustainability through App Evolution Studies , 2020, ACM Trans. Softw. Eng. Methodol..

[39]  Abdullah Talha Kabakus,et al.  An in-depth analysis of Android malware using hybrid techniques , 2018, Digit. Investig..

[40]  Haipeng Cai,et al.  Understanding Android Application Programming and Security: A Dynamic Study , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[41]  William Bradley Glisson,et al.  Machine Learning-Based Android Malware Detection Using Manifest Permissions , 2021, HICSS.

[42]  Haipeng Cai,et al.  Embracing Mobile App Evolution via Continuous Ecosystem Mining and Characterization , 2020, 2020 IEEE/ACM 7th International Conference on Mobile Software Engineering and Systems (MOBILESoft).

[43]  Bo Liu,et al.  MVIIDroid: A Multiple View Information Integration Approach for Android Malware Detection and Family Identification , 2020, IEEE MultiMedia.

[44]  Dima Alhadidi,et al.  Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning , 2020, 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech).

[45]  Yajin Zhou,et al.  RiskRanker: scalable and accurate zero-day android malware detection , 2012, MobiSys '12.

[46]  Sven Nõmm,et al.  Differences in Android Behavior Between Real Device and Emulator: A Malware Detection Perspective , 2019, 2019 Sixth International Conference on Internet of Things: Systems, Management and Security (IOTSMS).

[47]  Ainuddin Wahid Abdul Wahab,et al.  A review on feature selection in mobile malware detection , 2015, Digit. Investig..

[48]  Sakir Sezer,et al.  DL-Droid: Deep learning based android malware detection using real devices , 2019, Comput. Secur..

[49]  Sakir Sezer,et al.  High accuracy android malware detection using ensemble learning , 2015, IET Inf. Secur..

[50]  Yibo Xue,et al.  Fine-grained Android Malware Detection based on Deep Learning , 2018, 2018 IEEE Conference on Communications and Network Security (CNS).

[51]  Lorenzo Cavallaro,et al.  TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time , 2018, USENIX Security Symposium.

[52]  Jules White,et al.  Applying machine learning classifiers to dynamic Android malware detection at scale , 2013, 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC).

[53]  Jacques Klein,et al.  AndroZoo: Collecting Millions of Android Apps for the Research Community , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[54]  Haipeng Cai,et al.  Artifacts for Dynamic Analysis of Android Apps , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[55]  Arash Habibi Lashkari,et al.  Extensible Android Malware Detection and Family Classification Using Network-Flows and API-Calls , 2019, 2019 International Carnahan Conference on Security Technology (ICCST).

[56]  Lorenzo Cavallaro,et al.  Transcending Transcend: Revisiting Malware Classification with Conformal Evaluation , 2020, ArXiv.

[57]  Min Yang,et al.  Enhancing State-of-the-art Classifiers with API Semantics to Detect Evolved Android Malware , 2020, CCS.

[58]  Sven Nomm,et al.  In-depth Feature Selection and Ranking for Automated Detection of Mobile Malware , 2019, ICISSP.

[59]  Andrew Honig,et al.  Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software , 2012 .

[60]  Lawrence D. Jackel,et al.  Limits on Learning Machine Accuracy Imposed by Data Quality , 1995, KDD.

[61]  Haipeng Cai,et al.  Towards sustainable Android malware detection , 2018, ICSE.

[62]  Haipeng Cai,et al.  DroidCat: Effective Android Malware Detection and Categorization via App-Level Profiling , 2019, IEEE Transactions on Information Forensics and Security.

[63]  Wei Wang,et al.  Effective android malware detection with a hybrid model based on deep autoencoder and convolutional neural network , 2018, Journal of Ambient Intelligence and Humanized Computing.

[64]  Xiaojiang Du,et al.  Permission-combination-based scheme for Android mobile malware detection , 2014, 2014 IEEE International Conference on Communications (ICC).

[65]  Marco Valtorta,et al.  The Effects of Data Quality on Machine Learning Algorithms , 2006, ICIQ.

[66]  Abdelwahab Hamou-Lhadj,et al.  A study of run-time behavioral evolution of benign versus malicious apps in android , 2020, Inf. Softw. Technol..

[67]  Idit Keidar,et al.  GPUfs: integrating a file system with GPUs , 2014, ASPLOS '13.

[68]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.