GAP: Forecasting commit activity in git projects

Abstract Abandonment of active developers poses a significant risk for many open source software projects. This risk can be reduced by forecasting the future activity of contributors involved in such projects. Focusing on the commit activity of individuals involved in git repositories, this paper proposes a practicable probabilistic forecasting model based on the statistical technique of survival analysis. The model is empirically validated on a wide variety of projects accounting for 7528 git repositories and 5947 active contributors. We found that a model based on the last 20 observed days of commit activity per contributor provides the best concordance. We also found that the predictions provided by the model are generally close to actual observations, with slight underestimations for low probability predictions and slight overestimations for higher probability predictions. This model is implemented as part of an open source tool, called GAP , that predicts future commit activity.

[1]  Karen Kafadar,et al.  Letter-Value Plots: Boxplots for Large Data , 2017 .

[2]  Dirk Riehle,et al.  Paid vs. Volunteer Work in Open Source , 2014, 2014 47th Hawaii International Conference on System Sciences.

[3]  Eirini Kalliamvakou,et al.  Mediterranean Conference on Information Systems ( MCIS ) 2009 Measuring Developer Contribution From Software Repository Data , 2017 .

[4]  Marco Aurélio Gerosa,et al.  A systematic literature review on the barriers faced by newcomers to open source software projects , 2015, Inf. Softw. Technol..

[5]  Alexander Serebrenik,et al.  Who's who in Gnome: Using LSA to merge software repository identities , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[6]  Marco Aurélio Gerosa,et al.  Why do developers take breaks from contributing to OSS projects? A preliminary analysis , 2019, SoHeal@ICSE.

[7]  Gregorio Robles,et al.  Effort estimation by characterizing developer activity , 2006, EDSER '06.

[8]  RoblesGregorio,et al.  Developer identification methods for integrated data from various sources , 2005 .

[9]  Audris Mockus,et al.  Quantifying and Mitigating Turnover-Induced Knowledge Loss: Case Studies of Chrome and a Project at Avaya , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[10]  Tom Mens,et al.  A comparison of identity merge algorithms for software repositories , 2013, Sci. Comput. Program..

[11]  Jason P. Fine,et al.  Statistical Primer for Cardiovascular Research Introduction to the Analysis of Survival Data in the Presence of Competing Risks , 2022 .

[12]  D. S. Wilks,et al.  Chapter 8 - Forecast Verification , 2011 .

[13]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[14]  Igor Steinmacher,et al.  Who drives company-owned OSS projects: internal or external members? , 2018, Journal of the Brazilian Computer Society.

[15]  Roger M. Stein Benchmarking default prediction models: pitfalls and remedies in model validation , 2007 .

[16]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[17]  Christian Kästner,et al.  Why Do People Give Up FLOSSing? A Study of Contributor Disengagement in Open Source , 2019, OSS.

[18]  O. Aalen,et al.  Survival and Event History Analysis: A Process Point of View , 2008 .

[19]  Marco Tulio Valente,et al.  Why modern open source projects fail , 2017, ESEC/SIGSOFT FSE.

[20]  Alexander Chatzigeorgiou,et al.  Maintenance Patterns of Large-Scale PHP Web Applications , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[21]  Tom Mens,et al.  An empirical comparison of dependency network evolution in seven software packaging ecosystems , 2017, Empirical Software Engineering.

[22]  Carmine Zoccali,et al.  When do we need competing risks methods for survival analysis in nephrology? , 2013, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[23]  Ioannis Stamelos,et al.  Survival analysis on the duration of open source projects , 2010, Inf. Softw. Technol..

[24]  Yulin Fang,et al.  Socialization in Open Source Software Projects: A Growth Mixture Modeling Approach , 2011 .

[25]  T. Mens,et al.  Socio-technical evolution of the Ruby ecosystem in GitHub , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[26]  Eleni Constantinou,et al.  On the abandonment and survival of open source projects: An empirical investigation , 2019, 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[27]  Shen Beijun,et al.  Mining GitHub: Why Commit Stops -- Exploring the Relationship between Developer's Commit Pattern and File Version Evolution , 2013, 2013 20th Asia-Pacific Software Engineering Conference (APSEC).

[28]  Gregorio Robles,et al.  Developer Turnover in Global, Industrial Open Source Projects: Insights from Applying Survival Analysis , 2017, 2017 IEEE 12th International Conference on Global Software Engineering (ICGSE).

[29]  Jonas Gamalielsson,et al.  Sustainability of Open Source software communities beyond a fork: How and why has the LibreOffice project evolved? , 2014, J. Syst. Softw..

[30]  Audris Mockus,et al.  Understanding and predicting effort in software projects , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[31]  Christoph Treude,et al.  Overcoming Open Source Project Entry Barriers with a Portal for Newcomers , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[32]  D. Collett,et al.  Modelling Survival Data in Medical Research, Second Edition , 2003 .

[33]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[34]  Jesús M. González-Barahona,et al.  Developer identification methods for integrated data from various sources , 2005, ACM SIGSOFT Softw. Eng. Notes.

[35]  Tom Mens,et al.  On the Interaction of Relational Database Access Technologies in Open Source Java Projects , 2015, SATToSE.

[36]  Robert H. Shumway,et al.  Time series analysis and its applications : with R examples , 2017 .

[37]  Eleni Constantinou,et al.  An empirical comparison of developer retention in the RubyGems and npm software ecosystems , 2017, Innovations in Systems and Software Engineering.

[38]  Kevin Crowston,et al.  The social structure of free and open source software development , 2005, First Monday.

[39]  Premkumar T. Devanbu,et al.  Open Borders? Immigration in Open Source Projects , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[40]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .