论文信息 - Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

Effort-aware just-in-time (JIT) defect prediction aims at finding more defective software changes with limited code inspection cost. Traditionally, supervised models have been used; however, they require sufficient labelled training data, which is difficult to obtain, especially for new projects. Recently, Yang et al. proposed an unsupervised model (i.e., LT) and applied it to projects with rich historical bug data. Interestingly, they reported that, under the same inspection cost (i.e., 20 percent of the total lines of code modified by all changes), it could find about 12% - 27% more defective changes than a state-of-the-art supervised model (i.e., EALR) when using different evaluation settings. This is surprising as supervised models that benefit from historical data are expected to perform better than unsupervised ones. Their finding suggests that previous studies on defect prediction had made a simple problem too complex. Considering the potential high impact of Yang et al.’s work, in this paper, we perform a replication study and present the following new findings: (1) Under the same inspection budget, LT requires developers to inspect a large number of changes necessitating many more context switches. (2) Although LT finds more defective changes, many highly ranked changes are false alarms. These initial false alarms may negatively impact practitioners’ patience and confidence. (3) LT does not outperform EALR when the harmonic mean of Recall and Precision (i.e., F1-score) is considered. Aside from highlighting the above findings, we propose a simple but improved supervised model called CBS+, which leverages the idea of both EALR and LT. We investigate the performance of CBS+ using three different evaluation settings, including time-wise cross-validation, 10-times 10-fold cross-validation and cross-project validation. When compared with EALR, CBS+ detects about 15% - 26% more defective changes, while keeping the number of context switches and initial false alarms close to those of EALR. When compared with LT, the number of defective changes detected by CBS+ is comparable to LT’s result, while CBS+ significantly reduces context switches and initial false alarms before first success. Finally, we discuss how to balance the tradeoff between the number of inspected defects and context switches, and present the implications of our findings for practitioners and researchers.

[1] Thomas Fritz,et al. Software developers' perceptions of productivity , 2014, SIGSOFT FSE.

[2] Akito Monden,et al. An analysis of developer metrics for fault prediction , 2010, PROMISE '10.

[3] Tim Menzies,et al. "Better Data" is Better than "Better Data Miners" (Benefits of Tuning SMOTE for Defect Prediction) , 2017, ICSE.

[4] Uirá Kulesza,et al. The impact of refactoring changes on the SZZ algorithm: An empirical study , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[5] Mary Shaw,et al. Experiences and results from initiating field defect prediction and product test prioritization efforts at ABB Inc. , 2006, ICSE.

[6] Yuming Zhou,et al. Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models , 2016, SIGSOFT FSE.

[7] Harald C. Gall,et al. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[8] Audris Mockus,et al. A large-scale empirical study of just-in-time quality assurance , 2013, IEEE Transactions on Software Engineering.

[9] Thomas Zimmermann,et al. Automatic Identification of Bug-Introducing Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[10] Ken-ichi Matsumoto,et al. Studying re-opened bugs in open source software , 2012, Empirical Software Engineering.

[11] F. Wilcoxon. Individual Comparisons by Ranking Methods , 1945 .

[12] Hongfang Liu,et al. An Investigation into the Functional Form of the Size-Defect Relationship for Software Modules , 2009, IEEE Transactions on Software Engineering.

[13] Elliot Soloway,et al. Where the bugs are , 1985, CHI '85.

[14] Ahmed E. Hassan,et al. An industrial study on the risk of software changes , 2012, SIGSOFT FSE.

[15] David Lo,et al. Collective Personalized Change Classification With Multiobjective Search , 2016, IEEE Transactions on Reliability.

[16] Hongfang Liu,et al. Testing the theory of relative defect proneness for closed-source software , 2010, Empirical Software Engineering.

[17] Yi Zhang,et al. Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[18] K. Goseva-Popstojanova,et al. Common Trends in Software Fault and Failure Data , 2009, IEEE Transactions on Software Engineering.

[19] Premkumar T. Devanbu,et al. Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[20] Shane McIntosh,et al. Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[21] Pornsiri Muenchaisri,et al. Predicting Faulty Classes Using Design Metrics with Discriminant Analysis , 2003, Software Engineering Research and Practice.

[22] Tracy Hall,et al. A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[23] Xinli Yang,et al. Deep Learning for Just-in-Time Defect Prediction , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[24] N. Cliff. Ordinal methods for behavioral data analysis , 1996 .

[25] Alessandro Orso,et al. Are automated debugging techniques actually helping programmers? , 2011, ISSTA '11.

[26] Jaechang Nam,et al. CLAMI: Defect Prediction on Unlabeled Datasets (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[27] Taghi M. Khoshgoftaar,et al. The Detection of Fault-Prone Programs , 1992, IEEE Trans. Software Eng..

[28] Andreas Zeller,et al. When do changes induce fixes? , 2005, ACM SIGSOFT Softw. Eng. Notes.

[29] Ahmed E. Hassan,et al. Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[30] Tibor Gyimóthy,et al. Empirical validation of object-oriented metrics on open source software for fault prediction , 2005, IEEE Transactions on Software Engineering.

[31] Rainer Koschke,et al. Effort-Aware Defect Prediction Models , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[32] Lingfeng Bao,et al. “Automated Debugging Considered Harmful” Considered Harmful: A User Study Revisiting the Usefulness of Spectra-Based Fault Localization Techniques with Professionals Using Real Bugs from Large Systems , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[33] Andreas Zeller,et al. Mining metrics to predict component failures , 2006, ICSE.

[34] David Lo,et al. Identifying self-admitted technical debt in open source projects using text mining , 2017, Empirical Software Engineering.

[35] N. Nagappan,et al. Use of relative code churn measures to predict system defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[36] Ayse Basar Bener,et al. Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[37] Lionel C. Briand,et al. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models , 2010, J. Syst. Softw..

[38] Tim Menzies,et al. Revisiting unsupervised learning for defect prediction , 2017, ESEC/SIGSOFT FSE.

[39] Yuming Zhou,et al. How Far We Have Progressed in the Journey? An Examination of Cross-Project Defect Prediction , 2018, ACM Trans. Softw. Eng. Methodol..

[40] Petra Perner,et al. Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[41] David Lo,et al. Supervised vs Unsupervised Models: A Holistic Look at Effort-Aware Just-in-Time Defect Prediction , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[42] Uirá Kulesza,et al. A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes , 2017, IEEE Transactions on Software Engineering.

[43] David Lo,et al. Practitioners' expectations on automated fault localization , 2016, ISSTA.

[44] Emad Shihab,et al. Characterizing and predicting blocking bugs in open source projects , 2014, MSR 2014.

[45] Shane McIntosh,et al. Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[46] Brendan Murphy,et al. Using Historical In-Process and Product Metrics for Early Estimation of Software Failures , 2006, 2006 17th International Symposium on Software Reliability Engineering.

[47] Dewayne E. Perry,et al. Toward understanding the rhetoric of small source code changes , 2005, IEEE Transactions on Software Engineering.

[48] Audris Mockus,et al. Predicting risk of software changes , 2000, Bell Labs Technical Journal.

[49] Philip J. Guo,et al. Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[50] J. Hintze,et al. Violin plots : A box plot-density trace synergism , 1998 .

[51] Tian Jiang,et al. Personalized defect prediction , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[52] Witold Pedrycz,et al. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[53] Ayse Basar Bener,et al. On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[54] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[55] Tim Menzies,et al. How good is your blind spot sampling policy , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[56] Lionel C. Briand,et al. Data Mining Techniques for Building Fault-proneness Models in Telecom Java Software , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[57] Michele Lanza,et al. An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[58] Ding Yuan,et al. How do fixes become bugs? , 2011, ESEC/FSE '11.

[59] David Lo,et al. File-Level Defect Prediction: Unsupervised vs. Supervised Models , 2017, 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[60] David Lo,et al. HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[61] Premkumar T. Devanbu,et al. How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[62] Harvey P. Siy,et al. Predicting Fault Incidence Using Software Change History , 2000, IEEE Trans. Software Eng..

[63] Xinli Yang,et al. TLEL: A two-layer ensemble learning approach for just-in-time defect prediction , 2017, Inf. Softw. Technol..