Evaluating Software User Feedback Classifiers on Unseen Apps, Datasets, and Metadata

Listening to user’s requirements is crucial to building and maintaining high quality software. Online software user feedback has been shown to contain large amounts of information useful to requirements engineering (RE). Previous studies have created machine learning classifiers for parsing this feedback for development insight. While these classifiers report generally good performance when evaluated on a test set, questions remain as to how well they extend to unseen data in various forms. This study evaluates machine learning classifiers performance on feedback for two common classification tasks (classifying bug reports and feature requests). Using seven datasets from prior research studies, we investigate the performance of classifiers when evaluated on feedback from different apps than those contained in the training set and when evaluated on completely different datasets (coming from different feedback platforms and/or labelled by different researchers). We also measure the difference in performance of using platform-specific metadata as a feature in classification. We demonstrate that classification performance is similar on feedback from unseen apps compared to seen apps in the majority of cases tested. However, the classifiers do not perform well on unseen datasets. We show that multidataset training or zero shot classification approaches can somewhat mitigate this performance decrease. Finally, we find that using metadata as features in classifying bug reports and feature requests does not lead to a statistically significant improvement in the majority of datasets tested. We discuss the implications of these results on developing user feedback classification models to analyse and extract software requirements. Peter Devine · Kelly Blincoe Human Aspects of Software Engineering Lab, University of Auckland, New Zealand E-mail: pdev438@aucklanduni.ac.nz; k.blincoe@auckland.ac.nz Yun Sing Koh School of Computer Science, University of Auckland ar X iv :2 11 2. 13 49 7v 1 [ cs .S E ] 2 7 D ec 2 02 1 2 Peter Devine et al.

[1]  Harald C. Gall,et al.  Analyzing reviews and code of mobile apps for better release planning , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[2]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3]  Bashar Nuseibeh,et al.  Requirements engineering: a roadmap , 2000, ICSE '00.

[4]  Grant Williams,et al.  Mining Twitter Feeds for Software User Requirements , 2017, 2017 IEEE 25th International Requirements Engineering Conference (RE).

[5]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[6]  Lin Liu,et al.  Conceptualising, extracting and analysing requirements arguments in users' forums: The CrowdRE‐Arg framework , 2020, J. Softw. Evol. Process..

[7]  Jane Cleland-Huang,et al.  Automated classification of non-functional requirements , 2007, Requirements Engineering.

[8]  Corrado Aaron Visaggio,et al.  Investigating the criticality of user‐reported issues through their relations with app rating , 2020, J. Softw. Evol. Process..

[9]  Mohammad Abdul Hadi,et al.  Evaluating Pre-Trained Models for User Feedback Analysis in Software Engineering: A Study on Classification of App-Reviews , 2021, ArXiv.

[10]  Walid Maalej,et al.  User feedback in the appstore: An empirical study , 2013, 2013 21st IEEE International Requirements Engineering Conference (RE).

[11]  Rachel Harrison,et al.  Online Reviews as First Class Artifacts in Mobile App Development , 2013, MobiCASE.

[12]  Kelly Blincoe,et al.  Can a Conversation Paint a Picture? Mining Requirements In Software Forums , 2019, 2019 IEEE 27th International Requirements Engineering Conference (RE).

[13]  Lukasz Radlinski,et al.  Empirical Analysis of the Impact of Requirements Engineering on Software Quality , 2012, REFSQ.

[14]  Andreas Vogelsang,et al.  Transfer Learning for Mining Feature Requests and Bug Reports from Tweets and App Store Reviews , 2021, 2021 IEEE 29th International Requirements Engineering Conference Workshops (REW).

[15]  Gabriele Bavota,et al.  Listening to the Crowd for the Release Planning of Mobile Apps , 2019, IEEE Transactions on Software Engineering.

[16]  Nirav Ajmeri,et al.  App Review Analysis Via Active Learning: Reducing Supervision Effort without Compromising Classification Accuracy , 2018, 2018 IEEE 26th International Requirements Engineering Conference (RE).

[17]  Adailton F. Araujo,et al.  From Bag-of-Words to Pre-trained Neural Language Models: Improving Automatic Classification of App Reviews for Requirements Engineering , 2020 .

[18]  Cor-Paul Bezemer,et al.  An empirical study of game reviews on the Steam platform , 2018, Empirical Software Engineering.

[19]  Maleknaz Nayebi,et al.  App store mining is not enough for app improvement , 2018, Empirical Software Engineering.

[20]  Colin J. Neill,et al.  State of practice in requirements engineering: contemporary data , 2014, Innovations in Systems and Software Engineering.

[21]  Walid Maalej,et al.  On the automatic classification of app reviews , 2016, Requirements Engineering.

[22]  Mohamed Ibrahim,et al.  A Little Bird Told Me: Mining Tweets for Requirements and Software Evolution , 2017, 2017 IEEE 25th International Requirements Engineering Conference (RE).

[23]  Patrick Mäder,et al.  Preventing Defects: The Impact of Requirements Traceability Completeness on Software Quality , 2017, IEEE Transactions on Software Engineering.

[24]  Dan Roth,et al.  Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach , 2019, EMNLP.

[25]  Steven Bethard,et al.  Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence , 2014, TACL.

[26]  Daniela E. Damian,et al.  An Empirical Study of the Complex Relationships between Requirements Engineering Processes and Other Processes that Lead to Payoffs in Productivity, Quality, and Risk Management , 2006, IEEE Transactions on Software Engineering.

[27]  Harald C. Gall,et al.  ARdoc: app reviews development oriented classifier , 2016, SIGSOFT FSE.

[28]  Elli Georgiadou,et al.  Requirements Engineering and Process Modelling in Software Quality Management— Towards a Generic Process Metamodel , 2004, Software Quality Journal.

[29]  Jelena Zdravkovic,et al.  Data-Driven Requirements Elicitation: A Systematic Literature Review , 2021, SN Computer Science.

[30]  Manfred Broy,et al.  Requirements Engineering as a Key to Holistic Software Quality , 2006, ISCIS.

[31]  Alan C. Gillies,et al.  Software Quality: Theory and Management , 1992 .

[32]  Bernd Bruegge,et al.  Ensemble Methods for App Review Classification: An Approach for Software Evolution (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Walid Maalej,et al.  Classifying Multilingual User Feedback using Traditional Machine Learning and Deep Learning , 2019, 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW).