Fair and balanced?: bias in bug-fix datasets

Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data.

[1]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[2]  Andreas Zeller,et al.  Predicting vulnerable software components , 2007, CCS '07.

[3]  Akif Günes Koru,et al.  Defect handling in medium and large open source projects , 2004, IEEE Software.

[4]  Audris Mockus,et al.  Missing Data in Software Engineering , 2008, Guide to Advanced Empirical Software Engineering.

[5]  Thomas Zimmermann,et al.  Preprocessing CVS Data for Fine-Grained Analysis , 2004, MSR.

[6]  Thomas Zimmermann,et al.  Automatic Identification of Bug-Introducing Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[7]  Harald C. Gall,et al.  Populating a Release History Database from version control and bug tracking systems , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[8]  Daniela E. Damian,et al.  Selecting Empirical Methods for Software Engineering Research , 2008, Guide to Advanced Empirical Software Engineering.

[9]  Janice Singer,et al.  Hipikat: a project memory for software development , 2005, IEEE Transactions on Software Engineering.

[10]  Andreas Zeller,et al.  When do changes induce fixes? , 2005, ACM SIGSOFT Softw. Eng. Notes.

[11]  R. Berk An introduction to sample selection bias in sociological data. , 1983 .

[12]  Victor R. Basili,et al.  A Methodology for Collecting Valid Software Engineering Data , 1984, IEEE Transactions on Software Engineering.

[13]  Alfred V. Aho,et al.  Do Crosscutting Concerns Cause Defects? , 2008, IEEE Transactions on Software Engineering.

[14]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[15]  Robert B. Grady,et al.  Software Metrics: Establishing a Company-Wide Program , 1987 .

[16]  Carsten Tautz,et al.  Practical Evaluation of an Organizational Memory Using the Goal-Question-Metric Technique , 1999, XPS.

[17]  Joseph D Terwilliger,et al.  Confounding, ascertainment bias, and the blind quest for a genetic ‘fountain of youth’ , 2003, Annals of medicine.

[18]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[19]  Andreas Zeller,et al.  Predicting component failures at design time , 2006, ISESE '06.

[20]  J. Heckman Sample selection bias as a specification error , 1979 .

[21]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[22]  Hongfang Liu,et al.  An investigation of the effect of module size on defect prediction using static measures , 2005, PROMISE@ICSE.

[23]  Thomas Zimmermann,et al.  When do changes induce fixes? On Fridays , 2005 .

[24]  Bruce C. Straits,et al.  Approaches to social research , 1993 .

[25]  Adam A. Porter,et al.  Empirical studies of software engineering: a roadmap , 2000, ICSE '00.

[26]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[27]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Mark R. Levy,et al.  The Methodology and Performance of Election Day Polls , 1983 .

[29]  Maria Elizabeth Grabe,et al.  Explicating Sensationalism in Television News: Content and the Bells and Whistles of Form , 2001 .

[30]  P. Easterbrook,et al.  Publication bias in clinical research , 1991, The Lancet.

[31]  Martin Shepperd,et al.  Data Sets and Data Quality in Software Engineering: Eight Years On , 2016, PROMISE.

[32]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[33]  Shoshana Zuboff,et al.  In the Age of the Smart Machine: The Future of Work and Power , 1989 .

[34]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .

[35]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[36]  Abraham Bernstein,et al.  Software process data quality and characteristics: a historical view on open and closed source projects , 2009, IWPSE-Evol '09.

[37]  Sunghun Kim,et al.  Memories of bug fixes , 2006, SIGSOFT '06/FSE-14.

[38]  William A. Brenneman Statistics for Research , 2005, Technometrics.

[39]  Thomas J. Ostrand,et al.  Proceedings of the 4th international workshop on Predictor models in software engineering , 2008, ICSE 2008.

[40]  L. Gasser,et al.  Distributed Collective Practices and Free / Open-Source Software Problem Management : Perspectives and Methods , 2003 .

[41]  Gregorio Robles,et al.  Towards predictor models for large libre software projects , 2005, PROMISE@ICSE.

[42]  Janice Singer,et al.  Guide to Advanced Empirical Software Engineering , 2007 .

[43]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[44]  Thomas Zimmermann,et al.  Extraction of bug localization benchmarks from history , 2007, ASE.

[45]  Bhekisipho Twala,et al.  Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[46]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[47]  H. D. Rombach,et al.  The Goal Question Metric Approach , 1994 .

[48]  R. Nickerson Confirmation Bias: A Ubiquitous Phenomenon in Many Guises , 1998 .