Mining App Stores: Extracting Technical, Business and Customer Rating Information for Analysis and Prediction

This paper formulates app store analysis as an instance of software repository mining. We use data mining to extract feature information, together with more readily available price and popularity information, to support analysis that combines technical, business and customer facing app store properties. We applied our approach to 32,108 non-zero priced apps available from the Blackberry app store. Our results show that there is a strong correlation between customer rating and the rank of app downloads, though perhaps surprisingly, there is no correlation between price and downloads, nor between price and rating. We provide empirical evidence that our extracted features are meaningful and valuable: they maintain correlations observed at the app level and provide the input to price prediction system that we construct using Case Based Reasoning. Our prediction system statistically significantly outperforms recommended existing approaches to price estimation (and with at least medium effect size) in 16 out of 17 of Blackberry App Store categories.

[1]  Laurence D. Mueller,et al.  Statistical Inference on Measures of Niche Overlap , 1985 .

[2]  Lionel C. Briand,et al.  A replicated assessment and comparison of common software cost modeling techniques , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[3]  Arie van Deursen,et al.  Mining Software Repositories to Study Co-Evolution of Production & Test Code , 2008, 2008 1st International Conference on Software Testing, Verification, and Validation.

[4]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[5]  Gokhan Memik,et al.  Into the wild: Studying real user activity patterns to guide power optimizations for mobile architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Tao Xie,et al.  WHYPER: Towards Automating Risk Assessment of Mobile Applications , 2013, USENIX Security Symposium.

[7]  Emilia Mendes The Use of Bayesian Networks for Web Effort Estimation: Further Investigation , 2008, 2008 Eighth International Conference on Web Engineering.

[8]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007, IEEE Transactions on Software Engineering.

[9]  Emilia Mendes,et al.  Web effort estimation: the value of cross-company data set compared to single-company data set , 2012, PROMISE '12.

[10]  John C. Tang,et al.  Mobile taskflow in context: a screenshot study of smartphone usage , 2010, CHI.

[11]  J. Royston An Extension of Shapiro and Wilk's W Test for Normality to Large Samples , 1982 .

[12]  Tim Menzies,et al.  On the Value of Ensemble Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[13]  Toshihiko Yamakami Foundation-based mobile platform software engineering: implications to convergence to open source software , 2009, ICIS '09.

[14]  Mark Harman,et al.  The relationship between search based software engineering and predictive modeling , 2010, PROMISE '10.

[15]  Ahmed E. Hassan,et al.  Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report , 2012, J. Syst. Softw..

[16]  Tsvi Kuflik,et al.  Functionality-based clustering using short textual description: helping users to find apps installed on their mobile device , 2013, IUI '13.

[17]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[18]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[19]  Mark Harman,et al.  Search-based software engineering , 2001, Inf. Softw. Technol..

[20]  Yuanyuan Zhang,et al.  App store mining and analysis: MSR for app stores , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[21]  Ahmed E. Hassan,et al.  Mining Software Repositories to Assist Developers and Support Managers , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[22]  Stephen G. MacDonell,et al.  Evaluating prediction systems in software project estimation , 2012, Inf. Softw. Technol..

[23]  Roy Want iPhone: Smarter Than the Average Phone , 2010, IEEE Pervasive Computing.

[24]  Youngjin Yoo,et al.  Dynamic structures of control and generativity in digital ecosystem service innovation: the cases of the Apple and Google mobile app stores , 2011 .

[25]  Arati Baliga,et al.  Rootkits on smart phones: attacks, implications and opportunities , 2010, HotMobile '10.

[26]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[27]  Yong Hu,et al.  Systematic literature review of machine learning based software development effort estimation models , 2012, Inf. Softw. Technol..

[28]  Rachel Harrison,et al.  Retrieving and analyzing mobile apps feature requests from online reviews , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[29]  Eyal de Lara,et al.  Efficient and transparent dynamic content updates for mobile clients , 2006, MobiSys '06.

[30]  Songwu Lu,et al.  SmartSiren: virus detection and alert for smartphones , 2007, MobiSys '07.

[31]  Andreas Zeller,et al.  Learning from 6,000 projects: lightweight cross-project anomaly detection , 2010, ISSTA '10.

[32]  Earl Oliver,et al.  The challenges in large-scale smartphone user studies , 2010, HotPlanet '10.

[33]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[34]  Tim Menzies Beyond data mining; towards "idea engineering" , 2013, PROMISE.

[35]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[36]  Jane Cleland-Huang,et al.  On-demand feature recommendations derived from mining public product descriptions , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[37]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[38]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[39]  Lionel C. Briand,et al.  An assessment and comparison of common software cost estimation modeling techniques , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[40]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[41]  Michele Lanza,et al.  Software Analytics for Mobile Applications--Insights & Lessons Learned , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[42]  Ahmad Rahmati,et al.  A Longitudinal Study of Non-Voice Mobile Phone Usage by Teens from an Underserved Urban Community , 2010, ArXiv.