Android Malware Familial Classification and Representative Sample Selection via Frequent Subgraph Analysis

The rapid increase in the number of Android malware poses great challenges to anti-malware systems, because the sheer number of malware samples overwhelms malware analysis systems. The classification of malware samples into families, such that the common features shared by malware samples in the same family can be exploited in malware detection and inspection, is a promising approach for accelerating malware analysis. Furthermore, the selection of representative malware samples in each family can drastically decrease the number of malware to be analyzed. However, the existing classification solutions are limited because of the following reasons. First, the legitimate part of the malware may misguide the classification algorithms because the majority of Android malware are constructed by inserting malicious components into popular apps. Second, the polymorphic variants of Android malware can evade detection by employing transformation attacks. In this paper, we propose a novel approach that constructs frequent subgraphs (fregraphs) to represent the common behaviors of malware samples that belong to the same family. Moreover, we propose and develop FalDroid, a novel system that automatically classifies Android malware and selects representative malware samples in accordance with fregraphs. We apply it to 8407 malware samples from 36 families. Experimental results show that FalDroid can correctly classify 94.2% of malware samples into their families using approximately 4.6 sec per app. FalDroid can also dramatically reduce the cost of malware investigation by selecting only 8.5% to 22% representative samples that exhibit the most common malicious behavior among all samples.

[1]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[3]  Mu Zhang,et al.  Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs , 2014, CCS.

[4]  Qinghua Zheng,et al.  Exploiting thread-related system calls for plagiarism detection of multithreaded programs , 2016, J. Syst. Softw..

[5]  Chao Yang,et al.  DroidMiner: Automated Mining and Characterization of Fine-grained Malicious Behaviors in Android Applications , 2014, ESORICS.

[6]  Jacques Klein,et al.  DroidRA: taming reflection to support whole-program analysis of Android apps , 2016, ISSTA.

[7]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[8]  Hao Chen,et al.  AnDarwin: Scalable Detection of Semantically Similar Android Applications , 2013, ESORICS.

[9]  Qinghua Zheng,et al.  Software Plagiarism Detection with Birthmarks Based on Dynamic Key Instruction Sequences , 2015, IEEE Transactions on Software Engineering.

[10]  Somesh Jha,et al.  Synthesizing Near-Optimal Malware Specifications from Suspicious Behaviors , 2010, 2010 IEEE Symposium on Security and Privacy.

[11]  Isil Dillig,et al.  Apposcopy: semantics-based detection of Android malware through static analysis , 2014, SIGSOFT FSE.

[12]  Eric Bodden,et al.  A Machine-learning Approach for Classifying and Categorizing Android Sources and Sinks , 2014, NDSS.

[13]  Haoyu Wang,et al.  WuKong: a scalable and accurate two-phase approach to Android app clone detection , 2015, ISSTA.

[14]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[15]  Jacques Klein,et al.  FlowDroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps , 2014, PLDI.

[16]  Qinghua Zheng,et al.  Reviving Sequential Program Birthmarking for Multithreaded Software Plagiarism Detection , 2018, IEEE Transactions on Software Engineering.

[17]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Kang G. Shin,et al.  Large-scale malware indexing using function-call graphs , 2009, CCS.

[19]  Hao Chen,et al.  Attack of the Clones: Detecting Cloned Applications on Android Markets , 2012, ESORICS.

[20]  Peng Wang,et al.  Finding Unknown Malice in 10 Seconds: Mass Vetting for New Threats at the Google-Play Scale , 2015, USENIX Security Symposium.

[21]  Alessandra Gorla,et al.  Mining Apps for Abnormal Usage of Sensitive Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[22]  Yuan Zhang,et al.  Vetting undesirable behaviors in android apps with permission use analysis , 2013, CCS.

[23]  Sencun Zhu,et al.  ViewDroid: towards obfuscation-resilient mobile application repackaging detection , 2014, WiSec '14.

[24]  Peng Wang,et al.  AsDroid: detecting stealthy behaviors in Android applications by user interface and program behavior contradiction , 2014, ICSE.

[25]  Yajin Zhou,et al.  Fast, scalable detection of "Piggybacked" mobile applications , 2013, CODASPY.

[26]  Joris Kinable,et al.  Malware classification based on call graph clustering , 2010, Journal in Computer Virology.

[27]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[28]  Juan E. Tapiador,et al.  TriFlow: Triaging Android Applications using Speculative Information Flows , 2017, AsiaCCS.

[29]  Réka Albert,et al.  Near linear time algorithm to detect community structures in large-scale networks. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[30]  Yuan-Cheng Lai,et al.  Identifying android malicious repackaged applications by thread-grained system call sequences , 2013, Comput. Secur..

[31]  Ilia Nouretdinov,et al.  Transcend: Detecting Concept Drift in Malware Classification Models , 2017, USENIX Security Symposium.

[32]  Lei Xue,et al.  Adaptive Unpacking of Android Apps , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[33]  Marcus A. Maloof,et al.  Learning to Detect and Classify Malicious Executables in the Wild , 2006, J. Mach. Learn. Res..

[34]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[35]  Qinghua Zheng,et al.  Exploring community structure of software Call Graph and its applications in class cohesion measurement , 2015, J. Syst. Softw..

[36]  Isil Dillig,et al.  Automated Synthesis of Semantic Malware Signatures using Maximum Satisfiability , 2016, NDSS.

[37]  Danai Koutra,et al.  RolX: structural role extraction & mining in large graphs , 2012, KDD.

[38]  Xin Sun,et al.  Detecting Code Reuse in Android Applications Using Component-Based Control Flow Graph , 2014, SEC.

[39]  Peng Liu,et al.  Achieving accuracy and scalability simultaneously in detecting application clones on Android markets , 2014, ICSE.

[40]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[41]  Aristide Fattori,et al.  CopperDroid: Automatic Reconstruction of Android Malware Behaviors , 2015, NDSS.

[42]  Juan E. Tapiador,et al.  Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families , 2014, Expert Syst. Appl..

[43]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[44]  Giulio Concas,et al.  A study of the community structure of a complex software network , 2013, 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM).

[45]  Xuxian Jiang,et al.  Catch Me If You Can: Evaluating Android Anti-Malware Against Transformation Attacks , 2014, IEEE Transactions on Information Forensics and Security.

[46]  Qinghua Zheng,et al.  Dependence Guided Symbolic Execution , 2017, IEEE Transactions on Software Engineering.

[47]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[48]  Xiapu Luo,et al.  DexHunter: Toward Extracting Hidden Code from Packed Android Applications , 2015, ESORICS.

[49]  Arun Lakhotia,et al.  DroidLegacy: Automated Familial Classification of Android Malware , 2014, PPREW'14.

[50]  Lei Zhang,et al.  Towards a scalable resource-driven approach for detecting repackaged Android applications , 2014, ACSAC.

[51]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.

[52]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[53]  Dawn Xiaodong Song,et al.  Contextual Policy Enforcement in Android Applications with Permission Event Graphs , 2013, NDSS.

[54]  Xiapu Luo,et al.  On Tracking Information Flows through JNI in Android Applications , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[55]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[56]  Qinghua Zheng,et al.  Frequent Subgraph Based Familial Classification of Android Malware , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[57]  Steven Salzberg,et al.  Programs for Machine Learning , 2004 .

[58]  Ming Fan,et al.  DAPASA: Detecting Android Piggybacked Apps Through Sensitive Subgraph Analysis , 2017, IEEE Transactions on Information Forensics and Security.

[59]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.