A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects

The continuous contributions made by long time contributors (LTCs) are a key factor enabling open source software (OSS) projects to be successful and survival. We study GITHUB as it has a large number of OSS projects and millions of contributors, which enables the study of the transition from newcomers to LTCs. In this paper, we investigate whether we can effectively predict newcomers in OSS projects to be LTCs based on their activity data that is collected from GITHUB. We collect GITHUB data from GHTorrent, a mirror of GITHUB data. We select the most popular 917 projects, which contain 75,046 contributors. We determine a developer as a LTC of a project if the time interval between his/her first and last commit in the project is larger than a certain time T . In our experiment, we use three different settings on the time interval: 1, 2, and 3 years. There are 9,238, 3,968, and 1,577 contributors who become LTCs of a project in three settings of time interval, respectively. To build a prediction model, we extract many features from the activities of developers on GITHUB, which group into five dimensions: developer profile, repository profile, developer monthly activity, repository monthly activity, and collaboration network. We apply several classifiers including naive Bayes, SVM, decision tree, kNN and random forest. We find that random forest classifier achieves the best performance with AUCs of more than 0.75 in all three settings of time interval for LTCs. We also investigate the most important features that differentiate newcomers who become LTCs from newcomers who stay in the projects for a short time. We find that the number of followers is the most important feature in all three settings of the time interval studied. We also find that the programming language and the average number of commits contributed by other developers when a newcomer joins a project also belong to the top 10 most important features in all three settings of time interval for LTCs. Finally, we provide several implications for action based on our analysis results to help OSS projects retain newcomers.

[1]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[2]  T. Mens,et al.  Evidence for the Pareto principle in Open Source Software Activity , 2011 .

[3]  Ricardo Terra,et al.  AngularJS in the wild: a survey with 460 developers , 2016, PLATEAU@SPLASH.

[4]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[5]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[6]  Daniele Romano,et al.  Using source code metrics to predict change-prone Java interfaces , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[7]  Audris Mockus,et al.  Who Will Stay in the FLOSS Community? Modeling Participant’s Initial Behavior , 2015, IEEE Transactions on Software Engineering.

[8]  James D. Herbsleb,et al.  When Cultures Clash: Participation in Open Source Communities and Its Implications For Organizational Commitment , 2011, ICIS.

[9]  Marco Aurélio Gerosa,et al.  Attracting , Onboarding , and Retaining Newcomer Developers in Open Source Software Projects , 2014 .

[10]  Leon Moonen,et al.  Java quality assurance by detecting code smells , 2002, Ninth Working Conference on Reverse Engineering, 2002. Proceedings..

[11]  Martin J. Shepperd,et al.  How Do I Know Whether to Trust a Research Result? , 2015, IEEE Software.

[12]  Katherine J. Stewart,et al.  The Impact of Ideology on Effectiveness in Open Source Software Development Teams , 2006, MIS Q..

[13]  David Lo,et al.  ELBlocker: Predicting blocking bugs with ensemble imbalance learning , 2015, Inf. Softw. Technol..

[14]  Kelly Blincoe,et al.  Understanding the popular users: Following, affiliation influence and leadership on GitHub , 2016, Inf. Softw. Technol..

[15]  Meiyappan Nagappan,et al.  Curating GitHub for engineered software projects , 2017, Empirical Software Engineering.

[16]  Ahmed E. Hassan,et al.  An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[17]  M. Larsen,et al.  The Psychology of Survey Response , 2002 .

[18]  James M. Bieman,et al.  The FreeBSD project: a replication case study of open source development , 2005, IEEE Transactions on Software Engineering.

[19]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[20]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[21]  Michael R. Lyu,et al.  Achieving software quality with testing coverage measures , 1994, Computer.

[22]  Thomas Hess,et al.  An Empirical Study of Volunteer Members' Perceived Turnover in Open Source Software Projects , 2012, 2012 45th Hawaii International Conference on System Sciences.

[23]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[24]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[25]  Alberto Sillitti,et al.  Cooperation wordle using pre-attentive processing techniques , 2013, 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE).

[26]  Audris Mockus,et al.  Succession: Measuring transfer of code and developer productivity , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[27]  David Lo,et al.  Chaff from the Wheat: Characterizing and Determining Valid Bug Reports , 2020, IEEE Transactions on Software Engineering.

[28]  Thilo Mende,et al.  Replication of defect prediction studies: problems, pitfalls and recommendations , 2010, PROMISE '10.

[29]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[30]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[31]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[32]  David Lo,et al.  Evaluating defect prediction approaches using a massive set of metrics: an empirical study , 2015, SAC.

[33]  Gail C. Murphy,et al.  Impact of developer turnover on quality in open-source software , 2015, ESEC/SIGSOFT FSE.

[34]  Pratyush Nidhi Sharma,et al.  Examining Turnover in Open Source Software Projects Using Logistic Hierarchical Linear Modeling Approach , 2012, OSS.

[35]  P. Resnick,et al.  Building Successful Online Communities: Evidence-Based Social Design , 2012 .

[36]  Mohammed Abdou Janati Idrissi,et al.  From Periphery to Core: A Temporal Analysis of GitHub Contributors' Collaboration Network , 2017, PRO-VE.

[37]  Karim R. Lakhani,et al.  Community, Joining, and Specialization in Open Source Software Innovation: A Case Study , 2003 .

[38]  Stephan Diehl,et al.  Attribution Required: Stack Overflow Code Snippets in GitHub Projects , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[39]  Stefan Koch,et al.  Effort, co‐operation and co‐ordination in an open source software project: GNOME , 2002, Inf. Syst. J..

[40]  Gregorio Robles,et al.  Remote analysis and measurement of libre software systems by means of the CVSAnalY tool , 2004, ICSE 2004.

[41]  N. Cliff Ordinal methods for behavioral data analysis , 1996 .

[42]  Shane McIntosh,et al.  The Impact of Automated Parameter Optimization on Defect Prediction Models , 2018, IEEE Transactions on Software Engineering.

[43]  Premkumar T. Devanbu,et al.  Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[44]  Igor Steinmacher,et al.  Barriers Faced by Newcomers to Software-Crowdsourcing Projects , 2017, IEEE Software.

[45]  Yue Jiang,et al.  Can data transformation help in the detection of fault-prone modules? , 2008, DEFECTS '08.

[46]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[47]  Audris Mockus,et al.  Towards building a universal defect prediction model with rank transformed predictors , 2016, Empirical Software Engineering.

[48]  Tian Jiang,et al.  Personalized defect prediction , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[49]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[50]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[51]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[52]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[53]  Zhenchang Xing,et al.  Who Will Leave the Company?: A Large-Scale Industry Study of Developer Turnover by Mining Monthly Work Report , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[54]  Mikko Kivelä,et al.  Generalizations of the clustering coefficient to weighted complex networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[55]  Barbara Bickart Roger Tourangeau, Lance J. Rips, and Kenneth Rasinski, The Psychology of Survey Response , 2001 .

[56]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[57]  Emad Shihab,et al.  Examining the Impact of Self-Admitted Technical Debt on Software Quality , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[58]  Tim Menzies,et al.  Tuning for Software Analytics: is it Really Necessary? , 2016, Inf. Softw. Technol..

[59]  Peitsa Hynninen,et al.  Off-Site Commitment and Voluntary Turnover in GSD Projects , 2010, 2010 5th IEEE International Conference on Global Software Engineering.

[60]  Naoyasu Ubayashi,et al.  Magnet or Sticky? Measuring Project Characteristics from the Perspective of Developer Attraction and Retention , 2016, J. Inf. Process..

[61]  Tim Menzies,et al.  Special issue on repeatable results in software engineering prediction , 2012, Empirical Software Engineering.

[62]  Sven Laumer,et al.  Who Will Remain? An Evaluation of Actual Person-Job and Person-Team Fit to Predict Developer Retention in FLOSS Projects , 2012, 2012 45th Hawaii International Conference on System Sciences.

[63]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[64]  Bart Goethals,et al.  Predicting the severity of a reported bug , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[65]  Mohammad Alshayeb,et al.  Software defect prediction using ensemble learning on selected features , 2015, Inf. Softw. Technol..

[66]  David H. Wolpert,et al.  An Efficient Method To Estimate Bagging's Generalization Error , 1999, Machine Learning.

[67]  M. Pagano,et al.  Survival analysis. , 1996, Nutrition.

[68]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[69]  David Lo,et al.  What are the characteristics of high-rated apps? A case study on free Android Applications , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[70]  David Lo,et al.  Automating Change-Level Self-Admitted Technical Debt Determination , 2019, IEEE Transactions on Software Engineering.

[71]  Ulrik Brandes,et al.  Centrality Estimation in Large Networks , 2007, Int. J. Bifurc. Chaos.

[72]  Adam Wierzbicki,et al.  GitHub Projects. Quality Analysis of Open-Source Software , 2014, SocInfo.

[73]  Ingo Scholtes,et al.  Categorizing bugs with social networks: A case study on four open source software communities , 2013, 2013 35th International Conference on Software Engineering (ICSE).