Caspar: Extracting and Synthesizing User Stories of Problems from App Reviews

A user's review of an app often describes the user's interactions with the app. These interactions, which we interpret as mini stories, are prominent in reviews with negative ratings. In general, a story in an app review would contain at least two types of events: user actions and associated app behaviors. Being able to identify such stories would enable an app's developer in better maintaining and improving the app's functionality and enhancing user experience. We present Caspar, a method for extracting and synthesizing user-reported mini stories regarding app problems from reviews. By extending and applying natural language processing techniques, Caspar extracts ordered events from app reviews, classifies them as user actions or app problems, and synthesizes action-problem pairs. Our evaluation shows that Caspar is effective in finding action-problem pairs from reviews. First, Caspar classifies the events with an accuracy of 82.0% on manually labeled data. Second, relative to human evaluators, Caspar extracts event pairs with 92.9% precision and 34.2% recall. In addition, we train an inference model on the extracted action-problem pairs that automatically predicts possible app problems for different use cases. Preliminary evaluation shows that our method yields promising results. Caspar illustrates the potential for a deeper understanding of app reviews and possibly other natural language artifacts arising in software engineering.

[1]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[2]  Harald C. Gall,et al.  What would users change in my app? summarizing app reviews for recommending software changes , 2016, SIGSOFT FSE.

[3]  Walid Maalej,et al.  Bug report, feature request, or simply praise? On automatically classifying app reviews , 2015, 2015 IEEE 23rd International Requirements Engineering Conference (RE).

[4]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[5]  Zhe Zhang,et al.  Limbic: Author-Based Sentiment Aspect Modeling Regularized with Word Embeddings and Discourse Relations , 2018, EMNLP.

[6]  Zhe Zhang,et al.  Leveraging Structural and Semantic Correspondence for Attribute-Oriented Aspect Sentiment Discovery , 2019, EMNLP/IJCNLP.

[7]  Walid Maalej,et al.  User feedback in the appstore: An empirical study , 2013, 2013 21st IEEE International Requirements Engineering Conference (RE).

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Michal Karpowicz,et al.  Opinion Mining on the Web 2.0 - Characteristics of User Generated Content and Their Impacts , 2013, CHI-KDD.

[10]  Mark O. Riedl,et al.  A Simple and Effective Approach to the Story Cloze Test , 2018, NAACL-HLT.

[11]  Hao Wu,et al.  Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource , 2018, NAACL.

[12]  Gabriele Bavota,et al.  User reviews matter! Tracking crowdsourced reviews to support evolution of successful apps , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Norbert Seyff,et al.  A Needle in a Haystack: What Do Twitter Users Say about Software? , 2016, 2016 IEEE 24th International Requirements Engineering Conference (RE).

[15]  Bernd Brügge,et al.  User involvement in software evolution practice: A case study , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[16]  Roxana Girju,et al.  Using a Bigram Event Model to Predict Causal Potential , 2009, CICLing.

[17]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Event Chains , 2008, ACL.

[18]  Mirella Lapata,et al.  Inferring Sentence-internal Temporal Relations , 2004, NAACL.

[19]  Elahe Rahimtoroghi,et al.  Inference of Fine-Grained Event Causality from Blogs and Films , 2017, NEWS@ACL.

[20]  Walid Maalej,et al.  How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews , 2014, 2014 IEEE 22nd International Requirements Engineering Conference (RE).

[21]  James Pustejovsky,et al.  Machine Learning of Temporal Relations , 2006, ACL.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Ahmed E. Hassan,et al.  Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews , 2015, Empirical Software Engineering.

[24]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[25]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[26]  Nirav Ajmeri,et al.  App Review Analysis Via Active Learning: Reducing Supervision Effort without Compromising Classification Accuracy , 2018, 2018 IEEE 26th International Requirements Engineering Conference (RE).

[27]  Marilyn A. Walker,et al.  Inferring Narrative Causality between Event Pairs in Films , 2017, SIGDIAL Conference.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Rachel Harrison,et al.  Retrieving and analyzing mobile apps feature requests from online reviews , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[30]  Gholamreza Ghassem-Sani,et al.  Towards Unsupervised Learning of Temporal Relations between Events , 2012, J. Artif. Intell. Res..

[31]  Mirella Lapata,et al.  Learning Sentence-internal Temporal Relations , 2006, J. Artif. Intell. Res..

[32]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[33]  Harald C. Gall,et al.  How can i improve my app? Classifying user reviews for software maintenance and evolution , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[36]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[37]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[38]  Ning Chen,et al.  AR-miner: mining informative reviews for developers from mobile app marketplace , 2014, ICSE.

[39]  Amy J. Ko,et al.  A case study of post-deployment user feedback triage , 2011, CHASE.

[40]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[41]  Walid Maalej,et al.  Mining User Rationale from Software Reviews , 2017, 2017 IEEE 25th International Requirements Engineering Conference (RE).