Identifying and characterizing unmaintained projects in GitHub

Open source projects are key components of modern software development. Due to the appearance of novel platforms (e.g., GitHub and GitLab) for developing public code, developers has created thousands of open source projects. As a consequence, a significant number of open source projects is also unmaintained. To tackle this problem, in this thesis, we reported a set of quantitative and qualitative studies to help developers to maintain their open source projects. First, we surveyed the owners of open source projects that are no longer actively maintained, aiming to reveal the reasons for stop the maintenance of their projects. As result, we provide a set of nine reasons that motivated them to abandon their projects. Second, we conducted a survey with developers who recently became core contributors of popular GitHub projects. We reveal their motivations to contribute to these projects, the projects characteristics that mostly helped to contribute, and the barriers faced by them. Our key results show that the surveyed developers contributed to the projects because they are using them and need some improvements. The participants also answered that the lack of time of the project leaders was the principal barrier faced by them. Finally, the project characteristic which mostly helped them to contribute was the existence of a friendly community. Finally, in our third study, we propose a quantitative and datadriven model to identify GitHub projects that are not actively maintained. We train the model using a set of 13 features about project activity (e.g., commits, forks, and issues). The model achieved a precision of 80%, based on the feedback of 129 real open source developers and a recall of 96%. We also showed that the model can be used to identify unmaintained projects early, without having to wait for one year of inactivity, as commonly proposed in the literature. Finally, we defined a metric, called Level of Maintenance Activity (LMA), to assess the risks of projects become unmaintained. We provided evidence on the applicability of this metric, by investigating its usage in 2,927 active projects.

[1]  David Lo,et al.  Automated prediction of bug report priority using multi-factor analysis , 2014, Empirical Software Engineering.

[2]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[3]  Jordi Cabot,et al.  An Empirical Study on the Maturity of the Eclipse Modeling Ecosystem , 2017, 2017 ACM/IEEE 20th International Conference on Model Driven Engineering Languages and Systems (MODELS).

[4]  James M. Bieman,et al.  The FreeBSD project: a replication case study of open source development , 2005, IEEE Transactions on Software Engineering.

[5]  Stefan Koch,et al.  Effort, co‐operation and co‐ordination in an open source software project: GNOME , 2002, Inf. Syst. J..

[6]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[7]  Christoph Treude,et al.  Overcoming Open Source Project Entry Barriers with a Portal for Newcomers , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[8]  Pankaj Setia,et al.  How Peripheral Developers Contribute to Open-Source Software Development , 2012, Inf. Syst. Res..

[9]  Karthik Ramasubramanian,et al.  Machine Learning Model Evaluation , 2017 .

[10]  Bart Goethals,et al.  Predicting the severity of a reported bug , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[11]  Marco Tulio Valente,et al.  Why we refactor? confessions of GitHub contributors , 2016, SIGSOFT FSE.

[12]  Ioannis Stamelos,et al.  Survival analysis on the duration of open source projects , 2010, Inf. Softw. Technol..

[13]  June M. Verner,et al.  Why did your project fail? , 2009, Commun. ACM.

[14]  Christian Bird,et al.  "What Went Right and What Went Wrong": An Analysis of 155 Postmortems from Game Development , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[15]  Chao Liu,et al.  Recommending GitHub Projects for Developer Onboarding , 2018, IEEE Access.

[16]  Tom Mens,et al.  Towards an Interdisciplinary, Socio-technical Analysis of Software Ecosystems Health , 2017, BENEVOL.

[17]  Jesús M. González-Barahona,et al.  Evolution of the core team of developers in libre software projects , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[18]  Jesús M. González-Barahona,et al.  FLOSS 2013: a survey dataset about free software contributors: challenges for curating, sharing, and combining , 2014, MSR 2014.

[19]  Johan Sderberg,et al.  Hacking Capitalism: The Free and Open Source Software Movement , 2007 .

[20]  Karl Fogel,et al.  Producing open source software - how to run a successful free software project , 2005 .

[21]  Andrew Head,et al.  Social health cues developers use when choosing open source packages , 2016, SIGSOFT FSE.

[22]  Georgios Gousios,et al.  Work practices and challenges in pull-based development: the contributor's perspective , 2015, ICSE.

[23]  Fabio Kon,et al.  Free and Open Source Software Development and Research: Opportunities for Software Engineering , 2011, 2011 25th Brazilian Symposium on Software Engineering.

[24]  Klaas-Jan Stol,et al.  Is It All Lost? A Study of Inactive Open Source Projects , 2013, OSS.

[25]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[26]  Michel Wermelinger,et al.  Empirical Studies of Open Source Evolution , 2008, Software Evolution.

[27]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[28]  Richard P. Gabriel,et al.  Innovation happens elsewhere - open source as business strategy , 2005 .

[29]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[30]  Tina R. Patil,et al.  Performance Analysis of Naive Bayes and J 48 Classification Algorithm for Data Classification , 2013 .

[31]  Marco Tulio Valente,et al.  Measuring and analyzing code authorship in 1 + 118 open source projects , 2019, Sci. Comput. Program..

[32]  Chris Parnin,et al.  Can automated pull requests encourage software developers to upgrade out-of-date dependencies? , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[34]  Anna Sidorova,et al.  SURVIVAL OF OPEN-SOURCE PROJECTS: A POPULATION ECOLOGY PERSPECTIVE , 2003 .

[35]  Ken-ichi Matsumoto,et al.  Characteristics of Sustainable OSS Projects: A Theoretical and Empirical Study , 2015, 2015 IEEE/ACM 8th International Workshop on Cooperative and Human Aspects of Software Engineering.

[36]  Marco Aurélio Gerosa,et al.  Almost There: A Study on Quasi-Contributors in Open-Source Software Projects , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[37]  Georgios Gousios,et al.  Work Practices and Challenges in Pull-Based Development: The Integrator's Perspective , 2014, ICSE.

[38]  Marco Tulio Valente,et al.  Why modern open source projects fail , 2017, ESEC/SIGSOFT FSE.

[39]  Alexander Serebrenik,et al.  An Empirical Study on the Removal of Self-Admitted Technical Debt , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[40]  Darko Marinov,et al.  Usage, costs, and benefits of continuous integration in open-source projects , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[41]  Watts S. Humphrey Why Big Software Projects Fail: The 12 Key Questions , 2005 .

[42]  Zhenchang Xing,et al.  Who Will Leave the Company?: A Large-Scale Industry Study of Developer Turnover by Mining Monthly Work Report , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[43]  Marco Tulio Valente,et al.  A novel approach for estimating Truck Factors , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[44]  Georgios Gousios,et al.  Relationship between geographical location and evaluation of developer contributions in github , 2018, ESEM.

[45]  Alexander Serebrenik,et al.  Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[46]  Marco Tulio Valente,et al.  Understanding the Factors That Impact the Popularity of GitHub Repositories , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[47]  Josh Lerner,et al.  The Simple Economics of Open Source , 2000 .

[48]  Marco Tulio Valente,et al.  Predicting the Popularity of GitHub Repositories , 2016, PROMISE.

[49]  Fabio Kon,et al.  A Study of the Relationships between Source Code Metrics and Attractiveness in Free Software Projects , 2010, 2010 Brazilian Symposium on Software Engineering.

[50]  Adam Croom,et al.  Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure / Ford Foundation , 2016 .

[51]  Gabriele Bavota,et al.  API change and fault proneness: a threat to the success of Android apps , 2013, ESEC/FSE 2013.

[52]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[53]  Marco Tulio Valente,et al.  When should internal interfaces be promoted to public? , 2016, SIGSOFT FSE.

[54]  Mary Beth Chrissis,et al.  CMMI: Guidelines for Process Integration and Product Improvement , 2003 .

[55]  Sven Apel,et al.  Classifying Developers into Core and Peripheral: An Empirical Study on Count and Network Metrics , 2016, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[56]  Gregorio Robles,et al.  Developer Turnover in Global, Industrial Open Source Projects: Insights from Applying Survival Analysis , 2017, 2017 IEEE 12th International Conference on Global Software Engineering (ICGSE).

[57]  Gerardo Canfora,et al.  Who is going to mentor newcomers in open source projects? , 2012, SIGSOFT FSE.

[58]  Víctor Urrea,et al.  Letter to the Editor: Stability of Random Forest importance measures , 2011, Briefings Bioinform..

[59]  Marco Tulio Valente,et al.  Why and how Java developers break APIs , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[60]  Premkumar T. Devanbu,et al.  Quality and productivity outcomes relating to continuous integration in GitHub , 2015, ESEC/SIGSOFT FSE.

[61]  Jeffrey C. Carver,et al.  Peer impressions in open source organizations: A survey , 2014, J. Syst. Softw..

[62]  Marco Tulio Valente,et al.  What's in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform , 2018, J. Syst. Softw..

[63]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[64]  James D. Herbsleb,et al.  Ecosystem-level determinants of sustained activity in open-source projects: a case study of the PyPI ecosystem , 2018, ESEC/SIGSOFT FSE.

[65]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[66]  Magne Jørgensen,et al.  How large are software cost overruns? A review of the 1994 CHAOS report , 2006, Inf. Softw. Technol..

[67]  M.M. Lehman,et al.  Programs, life cycles, and laws of software evolution , 1980, Proceedings of the IEEE.

[68]  Joost Visser,et al.  Faster issue resolution with higher technical quality of software , 2011, Software Quality Journal.

[69]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[70]  Ferdian Thung,et al.  Automatic Defect Categorization , 2012, 2012 19th Working Conference on Reverse Engineering.

[71]  Marco Aurélio Gerosa,et al.  More Common Than You Think: An In-depth Study of Casual Contributors , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[72]  R. Kay The Analysis of Survival Data , 2012 .

[73]  Alexander Serebrenik,et al.  Code of conduct in open source projects , 2017, 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[74]  Audris Mockus,et al.  Who Will Stay in the FLOSS Community? Modeling Participant’s Initial Behavior , 2015, IEEE Transactions on Software Engineering.

[75]  R. Grissom,et al.  Effect sizes for research: A broad practical approach. , 2005 .

[76]  Uirá Kulesza,et al.  An Empirical Study of Delays in the Integration of Addressed Issues , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[77]  Pierre N. Robillard,et al.  Why Good Developers Write Bad Code: An Observational Case Study of the Impacts of Organizational Factors on Software Quality , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[78]  Mark Harman,et al.  Causal impact analysis for app releases in google play , 2016, SIGSOFT FSE.

[79]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[80]  Eirini Kalliamvakou,et al.  Open Source-Style Collaborative Development Practices in Commercial Projects Using GitHub , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[81]  Maurizio Morisio,et al.  Characteristics of open source projects , 2003, Seventh European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings..

[82]  David Lo,et al.  What are the characteristics of high-rated apps? A case study on free Android Applications , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[83]  Jeffrey C. Carver,et al.  Understanding the Impressions, Motivations, and Barriers of One Time Code Contributors to FLOSS Projects: A Survey , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[84]  Slinger Jansen,et al.  Measuring the health of open source software ecosystems: Beyond the scope of project health , 2014, Inf. Softw. Technol..

[85]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[86]  David Lo,et al.  Why and how developers fork what from whom in GitHub , 2017, Empirical Software Engineering.

[87]  Kouichi Kishida,et al.  Toward an understanding of the motivation of open source software developers , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[88]  Naoyasu Ubayashi,et al.  Magnet or sticky? an OSS project-by-project typology , 2014, MSR 2014.

[89]  Jesús M. González-Barahona,et al.  The evolution of the laws of software evolution , 2013, ACM Comput. Surv..

[90]  P. Oman,et al.  Metrics for assessing a software system's maintainability , 1992, Proceedings Conference on Software Maintenance 1992.

[91]  Daniela Cruzes,et al.  Recommended Steps for Thematic Synthesis in Software Engineering , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[92]  Dewayne E. Perry,et al.  Metrics and laws of software evolution-the nineties view , 1997, Proceedings Fourth International Software Metrics Symposium.

[93]  Dirk Riehle,et al.  Why Do Episodic Volunteers Stay in FLOSS Communities? , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).