Enriching iTunes App Store Categories via Topic Modeling

Mobile application development is an emerging lucrative and fast growing market. With the steady growth of the number of apps in the repositories the providers will inevitably face the need to fine-grain the existing hierarchy of categories used to organize the apps. In this paper we present a method to bootstrap the categorization process via topic modeling. We apply Latent Dirichlet Allocation (LDA) to the textual descriptions of iTunes apps in order to identify recurrent topics in the collection. We evaluate and discuss the results obtained from training the model on a set of almost 600,000 English-language app descriptions. Our results demonstrate that automated categorization via LDA-based topic modeling is a promising approach, that can help to structure, analyze and manage the content of app repositories. The topics produced complement the original iTunes categories, concretize and extend them by providing insights into the underlying category content.

[1]  Elena Gorbacheva,et al.  Towards a typology of business process management professionals: identifying patterns of competences through latent semantic analysis , 2016, Enterp. Inf. Syst..

[2]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[3]  K. Bailey Typologies and taxonomies: An introduction to classification techniques. , 1994 .

[4]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[5]  Yuanyuan Zhang,et al.  App store mining and analysis: MSR for app stores , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[6]  Jan vom Brocke,et al.  Identifying the Role of Information Systems in Achieving Energy-Related Environmental Sustainability using Text Mining , 2014, ECIS.

[7]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[8]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[9]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[10]  Nargis Pervin,et al.  Mobilewalla: A Mobile Application Search Engine , 2011, MobiCASE.

[11]  Sangaralingam Kajanan,et al.  Takeoff and Sustained Success of Apps in Hypercompetitive Mobile Platform Ecosystems: An Empirical Analysis , 2012, ICIS.

[12]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[13]  Anna Sidorova,et al.  Uncovering the Intellectual Core of the Information Systems Discipline , 2008, MIS Q..

[14]  Kristof Coussement,et al.  Improving Customer Complaint Management by Automatic Email Classification Using Linguistic Style Features as Predictors , 2007 .

[15]  Alessandra Gorla,et al.  Checking app behavior against app descriptions , 2014, ICSE.

[16]  Nina Oertel Taxonomy development in information systems: Developing a taxonomy of mobile applications , 2009 .

[17]  Timothy Baldwin,et al.  Evaluating topic models for digital libraries , 2010, JCDL '10.

[18]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[19]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[20]  Jan vom Brocke,et al.  Identifying and Characterizing Topics in Enterprise Content Management: a Latent Semantic Analysis of Vendor Case studies , 2014, ECIS.

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[22]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[23]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[24]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[25]  Lucian L. Visinescu,et al.  Text-mining the voice of the people , 2012, Commun. ACM.

[26]  Stephen W. Thomas Mining software repositories using topic models , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[27]  Toward a Typology of Business Process Management Professionals: Identifying Patterns of Competences through Latent Semantic Analysis , 2014 .

[28]  Jeffrey Heer,et al.  Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment , 2013, ICML.