A Case Study of Bias in Bug-Fix Datasets

Software quality researchers build software quality models by recovering traceability links between bug reports in issue tracking repositories and source code files. However, all too often the data stored in issue tracking repositories is not explicitly tagged or linked to source code. Researchers have to resort to heuristics to tag the data (e.g., to determine if an issue is a bug report or a work item), or to link a piece of code to a particular issue or bug. Recent studies by Bird et al. and by Antoniol et al. suggest that software models based on imperfect datasets with missing links to the code and incorrect tagging of issues, exhibit biases that compromise the validity and generality of the quality models built on top of the datasets. In this study, we verify the effects of such biases for a commercial project that enforces strict development guidelines and rules on the quality of the data in its issue tracking repository. Our results show that even in such a perfect setting, with a near-ideal dataset, biases do exist – leading us to conjecture that biases are more likely a symptom of the underlying software development process instead of being due to the used heuristics.

[1]  Ahmed E. Hassan,et al.  Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[2]  Harald C. Gall,et al.  Populating a Release History Database from version control and bug tracking systems , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[3]  Laurie A. Williams,et al.  Early estimation of software quality using in-process testing metrics , 2005, WoSQ@ICSE.

[4]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[5]  Premkumar T. Devanbu,et al.  Fair and balanced?: bias in bug-fix datasets , 2009, ESEC/FSE '09.

[6]  A. Strauss,et al.  Basics of qualitative research: Grounded theory procedures and techniques. , 1993 .

[7]  Gerard E. Dallal,et al.  Lies, Damn Lies, and Statistics: The Manipulation of Public Opinion in America. , 1976 .

[8]  Bhekisipho Twala,et al.  Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[9]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[10]  Audris Mockus,et al.  Missing Data in Software Engineering , 2008, Guide to Advanced Empirical Software Engineering.

[11]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[12]  Foutse Khomh,et al.  Is it a bug or an enhancement?: a text-based approach to classify change requests , 2008, CASCON '08.

[13]  V. Malheiros,et al.  A Visual Text Mining approach for Systematic Reviews , 2007, ESEM 2007.

[14]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[15]  Martin Shepperd,et al.  Data Sets and Data Quality in Software Engineering: Eight Years On , 2016, PROMISE.

[16]  Nachiappan Nagappan,et al.  Using Software Dependencies and Churn Metrics to Predict Field Failures: An Empirical Case Study , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[17]  Janice Singer,et al.  Guide to Advanced Empirical Software Engineering , 2007 .

[18]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[19]  R. Fisher On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .