Accepted or Abandoned? Predicting the Fate of Code Changes

Many mature Open-Source Software (OSS), as well as commercial, organizations have adopted peer code review as an integral part of the development process to ensure the quality of the product. Of particular interest are code changes that end up "abandoned," either because they are rejected, or (more commonly) because they are never accepted at all (at least not through the review tool). Several factors such as resource allocation, job environment, and efficiency mismatch between the author and the reviewer may cause a code change to be abandoned even after months of efforts from the developers and the reviewers. Predicting the review outcome of such code changes can ease the prioritization of tasks and the utilization of limited resources by saving time spent on low-quality code changes. In this paper, we conducted a comprehensive study to predict whether a code change is merged or abandoned and applied various well-known supervised machine learning algorithms. We propose PredCR, a Random Forest based model that predicts the review outcome of a code change with 0.91 f-measure at the beginning of the code change on the test set. Also, it improves predictions of abandoned changes by 27\%-103\% and merged changes by 5\%-11\%. Our model accurately classifies 93\% of the top 25\% code changes (with average 196 days duration) that go longest without being merged. PredCR can also adapt to the changes in feature values at different stages of the review process although it achieves very high performance at the very early stage (within 10\% of the review process). This way, prediction quality for a particular code change can improve as the code review progresses. We also conducted a study to find out the properties of an ideal training set for our tool. We found that training with the instances from the same projects ensures 9\%-25\% performance increase.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  N. Nagappan,et al.  Use of relative code churn measures to predict system defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[3]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[4]  Mark Masse,et al.  REST API Design Rulebook , 2011 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Thomas Zimmermann,et al.  Improving Code Review by Predicting Reviewers and Acceptance of Patches , 2009 .

[7]  Christian Bird,et al.  Characteristics of Useful Code Reviews: An Empirical Study at Microsoft , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[8]  Michael W. Godfrey,et al.  Investigating technical and non-technical factors influencing modern code review , 2015, Empirical Software Engineering.

[9]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[10]  Jeffrey C. Carver,et al.  Impact of Peer Code Review on Peer Impression Formation: A Survey , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[11]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007 .

[12]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007, IEEE Transactions on Software Engineering.

[13]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[14]  David Lo,et al.  Early prediction of merged code changes to prioritize reviewing tasks , 2018, Empirical Software Engineering.

[15]  Premkumar T. Devanbu,et al.  Will They Like This? Evaluating Code Contributions with Language Models , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[16]  Alberto Bacchelli,et al.  ETA: Estimated Time of Answer Predicting Response Time in Stack Overflow , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[17]  Lionel C. Briand,et al.  A replicated assessment and comparison of common software cost modeling techniques , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[18]  J. Friedman Stochastic gradient boosting , 2002 .

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  DongGyun Han,et al.  Writing Acceptable Patches: An Empirical Study of Open Source Project Patches , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[21]  Michael W. Godfrey,et al.  The influence of non-technical factors on code review , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[22]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[23]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[24]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[25]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[26]  Martin J. Shepperd,et al.  Using simulation to evaluate prediction techniques [for software] , 2001, Proceedings Seventh International Software Metrics Symposium.

[27]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[28]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[29]  Rory V. O'Connor,et al.  The situational factors that affect the software development process: Towards a comprehensive reference framework , 2012, Inf. Softw. Technol..

[30]  Elaine J. Weyuker,et al.  Does calling structure information improve the accuracy of fault prediction? , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[31]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[32]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[33]  Grzegorz Chrupala,et al.  Predicting the quality of questions on Stackoverflow , 2015, RANLP.

[34]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[35]  Audris Mockus,et al.  A large-scale empirical study of just-in-time quality assurance , 2013, IEEE Transactions on Software Engineering.

[36]  R. McGrath,et al.  When effect sizes disagree: the case of r and d. , 2006, Psychological methods.

[37]  Katsuro Inoue,et al.  Search-Based Peer Reviewers Recommendation in Modern Code Review , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[38]  Anindya Iqbal,et al.  SentiCR: A customized sentiment analysis tool for code review interactions , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[39]  Gabriele Bavota,et al.  Four eyes are better than two: On the impact of code reviews on software quality , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[40]  Jing Liu,et al.  Question Difficulty Estimation in Community Question Answering Services , 2013, EMNLP.

[41]  Michael W. Godfrey,et al.  Investigating code review quality: Do people and participation matter? , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).