Towards Effectively Test Report Classification to Assist Crowdsourced Testing

Context: Automatic classification of crowdsourced test reports is important due to their tremendous sizes and large proportion of noises. Most existing approaches towards this problem focus on examining the performance of different machine learning or information retrieval techniques, and most are evaluated on open source dataset. However, our observation reveals that these approaches generate poor and unstable performances on real industrial crowdsourced testing data. We further analyze the deep reason and find that industrial data have significant local bias, which degrades existing approaches. Goal: We aim at designing an approach to overcome the local bias in industrial data and automatically classifying true fault from the large amounts of crowdsourced reports. Method: We propose a cluster-based classification approach, which first clusters similar reports together and then builds classifiers based on most similar clusters with ensemble method. Results: Evaluation is conducted on 15,095 test reports of 35 industrial projects from Chinese largest crowdsourced testing platform and results are promising, with 0.89 precision and 0.97 recall on average. In addition, our approach improves the existing baselines by 17% - 63% in average precision and 15% - 61% in average recall. Conclusions: Results imply that our approach can effectively discriminate true fault from large amounts of crowdsourced reports, which can reduce the effort required for manual inspection and facilitate project management in crowdsourced testing. To the best of our knowledge, this is the first work to address the test report classification problem in real industrial crowdsourced testing practice.

[1]  Rouvoy Romain,et al.  Reproducing Context-Sensitive Crashes of Mobile Apps Using Crowdsourced Monitoring , 2016, 2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems (MOBILESoft).

[2]  Alex Berson,et al.  Building Data Mining Applications for CRM , 1999 .

[3]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[6]  Bernd Bruegge,et al.  Ensemble Methods for App Review Classification: An Approach for Software Evolution (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7]  David Lo,et al.  DRONE: Predicting Priority of Reported Bugs by Multi-factor Analysis , 2013, ICSM.

[8]  Konstantinos Tsiptsis,et al.  An Overview of Data Mining Techniques , 2010 .

[9]  Baowen Xu,et al.  Test report prioritization to assist crowdsourced testing , 2015, ESEC/SIGSOFT FSE.

[10]  Burak Turhan,et al.  On the dataset shift problem in software engineering prediction models , 2011, Empirical Software Engineering.

[11]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Tore Dybå,et al.  A systematic review of effect size in software engineering experiments , 2007, Inf. Softw. Technol..

[14]  Per Runeson,et al.  Analyzing Networks of Issue Reports , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[15]  Ingo Scholtes,et al.  Categorizing bugs with social networks: A case study on four open source software communities , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[16]  Mark Harman,et al.  Developer Recommendation for Crowdsourced Software Development Tasks , 2015, 2015 IEEE Symposium on Service-Oriented System Engineering.

[17]  Ball State,et al.  Comparison of Distance Measures in Cluster Analysis with Dichotomous Data , 2004 .

[18]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[19]  Ayse Basar Bener,et al.  An explanatory analysis on eclipse beta-release bugs through in-process metrics , 2011, WoSQ '11.

[20]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[21]  Ning Chen,et al.  Puzzle-based automatic testing: bringing humans into the loop by solving puzzles , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[22]  Walid Maalej,et al.  Bug report, feature request, or simply praise? On automatically classifying app reviews , 2015, 2015 IEEE 23rd International Requirements Engineering Conference (RE).

[23]  Song Wang,et al.  FixerCache: unsupervised caching active developers for diverse bug triage , 2014, ESEM '14.

[24]  Yu Zhou,et al.  Combining text mining and data mining for bug report classification , 2016, J. Softw. Evol. Process..

[25]  Tim Menzies,et al.  Automated severity assessment of software defect reports , 2008, 2008 IEEE International Conference on Software Maintenance.

[26]  Klaas-Jan Stol,et al.  Two's company, three's a crowd: a case study of crowdsourcing software development , 2014, ICSE.

[27]  Talles M. G. de A. Barbosa,et al.  Affective Crowdsourcing Applied to Usability Testing , 2014 .

[28]  Harald C. Gall,et al.  How can i improve my app? Classifying user reviews for software maintenance and evolution , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[29]  Tim Menzies,et al.  Local vs. global models for effort estimation and defect prediction , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[30]  Christian Bird,et al.  Leveraging the Crowd: How 48,000 Users Helped Improve Lync Performance , 2013, IEEE Software.