Semi-supervised classification for dynamic Android malware detection

A growing number of threats to Android phones creates challenges for malware detection. Manually labeling the samples into benign or different malicious families requires tremendous human efforts, while it is comparably easy and cheap to obtain a large amount of unlabeled APKs from various sources. Moreover, the fast-paced evolution of Android malware continuously generates derivative malware families. These families often contain new signatures, which can escape detection when using static analysis. These practical challenges can also cause traditional supervised machine learning algorithms to degrade in performance. In this paper, we propose a framework that uses model-based semi-supervised (MBSS) classification scheme on the dynamic Android API call logs. The semi-supervised approach efficiently uses the labeled and unlabeled APKs to estimate a finite mixture model of Gaussian distributions via conditional expectation-maximization and efficiently detects malwares during out-of-sample testing. We compare MBSS with the popular malware detection classifiers such as support vector machine (SVM), $k$-nearest neighbor (kNN) and linear discriminant analysis (LDA). Under the ideal classification setting, MBSS has competitive performance with 98\% accuracy and very low false positive rate for in-sample classification. For out-of-sample testing, the out-of-sample test data exhibit similar behavior of retrieving phone information and sending to the network, compared with in-sample training set. When this similarity is strong, MBSS and SVM with linear kernel maintain 90\% detection rate while $k$NN and LDA suffer great performance degradation. When this similarity is slightly weaker, all classifiers degrade in performance, but MBSS still performs significantly better than other classifiers.

[1]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[2]  T. Moon The expectation-maximization algorithm , 1996, IEEE Signal Process. Mag..

[3]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[4]  A. Raftery Bayes Factors and BIC , 1999 .

[5]  Christopher Krügel,et al.  Going Native: Using a Large-Scale Analysis of Android Apps to Create a Practical Native-Code Sandboxing Policy , 2016, NDSS.

[6]  Alex Pentland,et al.  Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm , 1998, NIPS.

[7]  Wenke Lee,et al.  CHEX: statically vetting Android apps for component hijacking vulnerabilities , 2012, CCS.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  N. Dean,et al.  Using unlabelled data to update classification rules with applications in food authenticity studies , 2006 .

[10]  David Lie,et al.  IntelliDroid: A Targeted Input Generator for the Dynamic Analysis of Android Malware , 2016, NDSS.

[11]  Heng Yin,et al.  DroidAPIMiner: Mining API-Level Features for Robust Malware Detection in Android , 2013, SecureComm.

[12]  Li Chen,et al.  Stochastic Blockmodeling for Online Advertising , 2014, AAAI.

[13]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[14]  J. Doug Tygar,et al.  Adversarial machine learning , 2019, AISec '11.

[15]  Insik Shin,et al.  FLEXDROID: Enforcing In-App Privilege Separation in Android , 2016, NDSS.

[16]  Li Chen,et al.  Spectral clustering for divide-and-conquer graph matching , 2013, Parallel Comput..

[17]  G. Celeux,et al.  Regularized Gaussian Discriminant Analysis through Eigenvalue Decomposition , 1996 .

[18]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[19]  Jun Sun,et al.  Auditing Anti-Malware Tools by Evolving Android Malware and Dynamic Loading Technique , 2017, IEEE Transactions on Information Forensics and Security.

[20]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[21]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .