Fine‐Grained Mobile Application Clustering Model Using Retrofitted Document Embedding

In this paper, we propose a fine-grained mobile application clustering model using retrofitted document embedding. To automatically determine the clusters and their numbers with no predefined categories, the proposed model initializes the clusters based on title keywords and then merges similar clusters. For improved clustering performance, the proposed model distinguishes between an accurate clustering step with titles and an expansive clustering step with descriptions. During the accurate clustering step, an automatically tagged set is constructed as a result. This set is utilized to learn a high-performance document vector. During the expansive clustering step, more applications are then classified using this document vector. Experimental results showed that the purity of the proposed model increased by 0.19, and the entropy decreased by 1.18, compared with the K-means algorithm. In addition, the mean average precision improved by more than 0.09 in a comparison with a support vector machine classifier.

[1]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[2]  Yu Hu,et al.  Learning Semantic Word Embeddings based on Ordinal Knowledge Constraints , 2015, ACL.

[3]  S. S. Bedi,et al.  Categorization, clustering and association rule mining on WWW , 2009, 2009 International Multimedia, Signal Processing and Communication Technologies.

[4]  Markus Hegland,et al.  The Apriori Algorithm – a Tutorial , 2005 .

[5]  Gyeongyong Heo,et al.  Context-Aware Fusion with Support Vector Machine , 2014 .

[6]  Jason Nieh,et al.  A measurement study of google play , 2014, SIGMETRICS '14.

[7]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[8]  Giacomo Berardi,et al.  Multi-store metadata-based supervised mobile app classification , 2015, SAC.

[9]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[10]  Nemanja Djuric,et al.  Smartphone App Categorization for Interest Targeting in Advertising Marketplace , 2016, WWW.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Sanjoy Das,et al.  Opinion based on Polarity and Clustering for Product Feature Extraction , 2016 .

[13]  Leslie Pérez Cáceres,et al.  The irace package: Iterated racing for automatic algorithm configuration , 2016 .

[14]  A. Kongthon,et al.  Constructing term thesaurus using text association rule mining , 2008, 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[15]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[16]  Christian Platzer,et al.  MARVIN: Efficient and Comprehensive Mobile App Classification through Static and Dynamic Analysis , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  J. Dafni Rose,et al.  AN EFFICIENT ASSOCIATION RULE BASED HIERARCHICAL ALGORITHM FOR TEXT CLUSTERING , 2016 .

[19]  Wolf-Tilo Balke,et al.  Will I Like It? Providing Product Overviews Based on Opinion Excerpts , 2011, 2011 IEEE 13th Conference on Commerce and Enterprise Computing.

[20]  Hui Xiong,et al.  Exploiting enriched contextual information for mobile app classification , 2012, CIKM '12.

[21]  Hinrich Schütze,et al.  AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes , 2015, ACL.

[22]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[23]  Enhong Chen,et al.  Mobile App Classification with Enriched Contextual Information , 2014, IEEE Transactions on Mobile Computing.

[24]  Durga Toshniwal,et al.  Feature based Summarization of Customers' Reviews of Online Products , 2013, KES.

[25]  Wei Lin,et al.  Revisiting Word Embedding for Contrasting Meaning , 2015, ACL.

[26]  Claire Cardie,et al.  SimCompass: Using Deep Learning Word Embeddings to Assess Cross-level Similarity , 2014, *SEMEVAL.

[27]  Jeongman Heo,et al.  Word Cluster-based Mobile Application Categorization , 2014 .

[28]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[29]  Myung-Gil Jang,et al.  A Modified Fixed‐Threshold SMO for 1‐Slack Structural SVMs , 2010 .

[30]  Charu C. Aggarwal,et al.  Linked Document Embedding for Classification , 2016, CIKM.

[31]  Peng Guan,et al.  K-means Document Clustering Based on Latent Dirichlet Allocation , 2016 .

[32]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[33]  Richard Johansson,et al.  Embedding a Semantic Network in a Word Space , 2015, NAACL.

[34]  Hua Xu,et al.  Clustering product features for opinion mining , 2011, WSDM '11.

[35]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[36]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[37]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[38]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.