Effect of Technical and Social Factors on Pull Request Quality for the NPM Ecosystem

Background: Pull request (PR) based development, which is a norm for the social coding platforms, entails the challenge of evaluating the contributions of, often unfamiliar, developers from across the open source ecosystem and, conversely, submitting a contribution to a project with unfamiliar maintainers. Previous studies suggest that the decision of accepting or rejecting a PR may be influenced by a diverging set of technical and social factors, but often focus on relatively few projects, do not consider ecosystem-wide measures, or the possible non-monotonic relationships between the predictors and PR acceptance probability. Aim: We aim to shed light on this important decision making process by testing which measures significantly affect the probability of PR acceptance on a significant fraction of a large ecosystem, rank them by their relative importance in predicting PR acceptance, and determine the shape of the functions that map each predictor to PR acceptance. Method: We proposed seven hypotheses regarding which technical and social factors might affect PR acceptance and created 17 measures based on them. Our dataset consisted of 470,925 PRs from 3349 popular NPM packages and 79,128 GitHub users who created those. We tested which of the measures affect PR acceptance and ranked the significant measures by their importance in a predictive model. Results: Our predictive model had and AUC of 0.94, and 15 of the 17 measures were found to matter, including five novel ecosystem-wide measures. Measures describing the number of PRs submitted to a repository and what fraction of those get accepted, and signals about the PR review phase were most significant. We also discovered that only four predictors have a linear influence on the PR acceptance probability while others showed a more complicated response. Conclusion: Our findings should be helpful for PR creators, integrators, as well as tool designers to focus on the important factors affecting PR acceptance.

[1]  Chanchal Kumar Roy,et al.  An insight into the pull requests of GitHub , 2014, MSR 2014.

[2]  Daniel M. Germán,et al.  Peer Review on Open-Source Software Projects: Parameters, Statistical Models, and Theory , 2014, TSEM.

[3]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[4]  J. Deeks When can odds ratios mislead? , 1998 .

[5]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[6]  Chetan Bansal,et al.  Predicting pull request completion time: a case study on large scale cloud services , 2019, ESEC/SIGSOFT FSE.

[7]  Marco Aurélio Gerosa,et al.  Almost There: A Study on Quasi-Contributors in Open-Source Software Projects , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[8]  Georgios Gousios,et al.  Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective , 2014, ICSE.

[9]  Audris Mockus,et al.  Effectiveness of code contribution: from patch-based to pull-request-based tools , 2016, SIGSOFT FSE.

[10]  Gang Yin,et al.  Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? , 2016, Inf. Softw. Technol..

[11]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[12]  Douglas G Altman,et al.  Odds ratios should be avoided when events are common , 1998, BMJ.

[13]  James D. Herbsleb,et al.  Social coding in GitHub: transparency and collaboration in an open software repository , 2012, CSCW.

[14]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[15]  Daniel M. Germán,et al.  Will my patch make it? And how fast? Case study on the Linux kernel , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  James D. Herbsleb,et al.  Influence of social and technical factors for evaluating contribution in GitHub , 2014, ICSE.

[17]  Plotting regression surfaces with plotmo , 2019 .

[18]  Audris Mockus,et al.  World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[19]  Audris Mockus,et al.  Are Software Dependency Supply Chain Metrics Useful in Predicting Change of Popularity of NPM Packages? , 2018, PROMISE.

[20]  Premkumar T. Devanbu,et al.  Wait for It: Determinants of Pull Request Evaluation Latency on GitHub , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[21]  Jia-Huan He,et al.  Who should comment on this pull request? Analyzing attributes for more accurate commenter recommendation in pull-based development , 2017, Inf. Softw. Technol..

[22]  Georgios Gousios,et al.  Work practices and challenges in pull-based development: the contributor's perspective , 2015, ICSE.

[23]  Leonardo Gresta Paulino Murta,et al.  Acceptance factors of pull requests in open-source projects , 2015, SAC.

[24]  Minghui Zhou,et al.  Be careful of when: an empirical study on time-related misuse of issue tracking data , 2018, ESEC/SIGSOFT FSE.

[25]  Stephan Diehl,et al.  Small patches get in! , 2008, MSR '08.

[26]  Audris Mockus,et al.  Representation of Developer Expertise in Open Source Software , 2020, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[27]  Audris Mockus,et al.  Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem , 2019, PROMISE.

[28]  Audris Mockus,et al.  A Methodology for Measuring FLOSS Ecosystems , 2019, Towards Engineering Free/Libre Open Source Software (FLOSS) Ecosystems for Impact and Sustainability.

[29]  Georgios Gousios,et al.  Automatically Prioritizing Pull Requests , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[30]  Audris Mockus,et al.  Deriving a usage-independent software quality metric , 2020, Empirical Software Engineering.

[31]  Eleni Constantinou,et al.  On the Impact of Security Vulnerabilities in the npm Package Dependency Network , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[32]  Audris Mockus,et al.  Impact of Triage: A Study of Mozilla and Gnome , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[33]  Takashi Ishio,et al.  Towards Smoother Library Migrations: A Look at Vulnerable Dependency Migrations at Function Level for npm JavaScript Packages , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[34]  Audris Mockus,et al.  Detecting and Characterizing Bots that Commit Code , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[35]  Gang Yin,et al.  Who Should Review this Pull-Request: Reviewer Recommendation to Expedite Crowd Collaboration , 2014, 2014 21st Asia-Pacific Software Engineering Conference.

[36]  Audris Mockus,et al.  An Exploratory Study of Bot Commits , 2020, ICSE.

[37]  Audris Mockus,et al.  Modeling Relationship between Post-Release Faults and Usage in Mobile Software , 2018, PROMISE.

[38]  Gang Yin,et al.  Reviewer Recommender of Pull-Requests in GitHub , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[39]  Michael W. Godfrey,et al.  The Secret Life of Patches: A Firefox Case Study , 2012, 2012 19th Working Conference on Reverse Engineering.

[40]  Audris Mockus,et al.  A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits , 2020, MSR.