ClassifyDroid: Large scale Android applications classification using semi-supervised Multinomial Naive Bayes

Rapid advances in mobile internet have enabled mobile applications to enter the era of `Big Data' (large datasets). Classification on large scale Android applications has attracted great interest from both researchers and practitioners. However, most existing approaches are supervised learning method which needs lots of labeled data. Their use in practice is often limited due to lack of labeled data, large scale Android applications, or high manual label cost. In this paper, we present a novel large scale Android applications classification tool using semi-supervised Multinomial Naive Bayes (SMNB) algorithm, called ClassifyDroid. Our proposed model exploits SMNB algorithm widely used in text document analysis. The approach is based on the analysis of characteristic application program interface (API), which can be seen as equivalents to the words and keywords in a text document. Namely, each application is characterized as a vector according to the characteristic API in it, with the associated frequencies. We evaluated ClassifyDroid on 15590 samples chosen from mobile market (MM) App Store. Our experiments show that ClassifyDroid is both accurate and practical, which has a better classification result than MNB algorithm when the dataset contains little labeled applications and lots of unlabeled applications.

[1]  Wenjia Li,et al.  Detecting Malware for Android Platform: An SVM-Based Approach , 2015, 2015 IEEE 2nd International Conference on Cyber Security and Cloud Computing.

[2]  Yuval Elovici,et al.  Automated Static Code Analysis for Classifying Android Applications Using Machine Learning , 2010, 2010 International Conference on Computational Intelligence and Security.

[3]  Collin McMillan,et al.  Categorizing software applications for maintenance , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[4]  C. Lee Giles,et al.  What's the code?: automatic classification of source code archives , 2002, KDD.