Where Is the Road for Issue Reports Classification Based on Text Mining?

Currently, open source projects receive various kinds of issues daily, because of the extreme openness of Issue Tracking System (ITS) in GitHub. ITS is a labor-intensive and time-consuming task of issue categorization for project managers. However, a contributor is only required a short textual abstract to report an issue in GitHub. Thus, most traditional classification approaches based on detailed and structured data (e.g., priority, severity, software version and so on) are difficult to adopt. In this paper, issue classification approaches on a large-scale dataset, including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated. First, four traditional text-based classification methods and their performances were discussed. Semantic perplexity (i.e., an issues description confuses bug-related sentences with nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study. Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed. Results show that our two-stage classification can significantly improve issue classification performances.

[1]  Premkumar T. Devanbu,et al.  Quality and productivity outcomes relating to continuous integration in GitHub , 2015, ESEC/SIGSOFT FSE.

[2]  Foutse Khomh,et al.  Is it a bug or an enhancement?: a text-based approach to classify change requests , 2008, CASCON '08.

[3]  Jacques Klein,et al.  Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[4]  Andreas Zeller,et al.  Predicting vulnerable software components , 2007, CCS '07.

[5]  Yu Zhou,et al.  Combining text mining and data mining for bug report classification , 2016, J. Softw. Evol. Process..

[6]  Barbara Paech,et al.  Software Feature Request Detection in Issue Tracking Systems , 2016, 2016 IEEE 24th International Requirements Engineering Conference (RE).

[7]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[8]  YuYue,et al.  Reviewer recommendation for pull-requests in GitHub , 2016 .

[9]  Norman E. Wallen,et al.  How to Design and Evaluate Research in Education , 1990 .

[10]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[11]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[12]  Gang Yin,et al.  Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? , 2016, Inf. Softw. Technol..

[13]  Georgios Gousios,et al.  Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective , 2014, ICSE.

[14]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[15]  Thomas Zimmermann,et al.  Improving Code Review by Predicting Reviewers and Acceptance of Patches , 2009 .

[16]  Ingo Scholtes,et al.  Categorizing bugs with social networks: A case study on four open source software communities , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[17]  James D. Herbsleb,et al.  Let's talk about it: evaluating contributions through discussion in GitHub , 2014, SIGSOFT FSE.

[18]  Georgios Gousios,et al.  Work practices and challenges in pull-based development: the contributor's perspective , 2015, ICSE.

[19]  Gary Klein,et al.  An exploration of the relationship between software development process maturity and project performance , 2004, Inf. Manag..

[20]  Edgar Brunner,et al.  Rank-based multiple test procedures and simultaneous confidence intervals , 2012 .

[21]  Premkumar T. Devanbu,et al.  Wait for It: Determinants of Pull Request Evaluation Latency on GitHub , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[22]  Andreas Zeller,et al.  It's not a bug, it's a feature: How misclassification impacts bug prediction , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Bogdan Vasilescu,et al.  Developer initiation and social interactions in OSS: A case study of the Apache Software Foundation , 2015, Empirical Software Engineering.

[24]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[25]  Haiyi Zhu,et al.  Effectiveness of Conflict Management Strategies in Peer Review Process of Online Collaboration Projects , 2016, CSCW.

[26]  Michael W. Godfrey,et al.  A bug you like: A framework for automated assignment of bugs , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[27]  Andreas Zeller,et al.  Predicting defects using change genealogies , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[28]  Osamu Mizuno,et al.  Bug prediction based on fine-grained module histories , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[29]  Andreas Zeller,et al.  How Long Will It Take to Fix This Bug? , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[30]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[31]  Akito Monden,et al.  Patch Reviewer Recommendation in OSS Projects , 2013, 2013 20th Asia-Pacific Software Engineering Conference (APSEC).

[32]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[33]  K. Gabriel,et al.  SIMULTANEOUS TEST PROCEDURES-SOME THEORY OF MULTIPLE COMPARISONS' , 1969 .

[34]  Georgios Gousios,et al.  Lean GHTorrent: GitHub data on demand , 2014, MSR 2014.

[35]  Thomas Zimmermann,et al.  Improving bug tracking systems , 2009, 2009 31st International Conference on Software Engineering - Companion Volume.