Findings from GitHub: Methods, Datasets and Limitations

GitHub, one of the most popular social coding platforms, is the platform of reference when mining Open Source repositories to learn from past experiences. In the last years, a number of research papers have been published reporting findings based on data mined from GitHub. As the community continues to deepen in its understanding of software engineering thanks to the analysis performed on this platform, we believe it is worthwhile to reflect how research papers have addressed the task of mining GitHub repositories over the last years. In this regard, we present a meta-analysis of 93 research papers which addresses three main dimensions of those papers: i) the empirical methods employed, ii) the datasets they used and iii) the limitations reported. Results of our meta-analysis show some concerns regarding the dataset collection process and size, the low level of replicability, poor sampling techniques, lack of longitudinal studies and scarce variety of methodologies.

[1]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[2]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[3]  Tom Mens,et al.  Challenges in Software Ecosystems Research , 2015, ECSA Workshops.

[4]  Marcelo de Almeida Maia,et al.  Understanding the popularity of reporters and assignees in the Github , 2014, SEKE.

[5]  Kevin Crowston,et al.  Free/Libre open-source software development: What we know and what we do not know , 2012, CSUR.

[6]  Alexander Serebrenik,et al.  StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge , 2013, 2013 International Conference on Social Computing.

[7]  Eleni Stroulia,et al.  Involvement, contribution and influence in GitHub and stack overflow , 2014, CASCON.

[8]  Rohan Padhye,et al.  A study of external community contribution to open-source projects on GitHub , 2014, MSR 2014.

[9]  Eirini Kalliamvakou,et al.  An in-depth study of the promises and perils of mining GitHub , 2016, Empirical Software Engineering.

[10]  Gregorio Robles,et al.  Replicating MSR: A study of the potential replicability of papers published in the Mining Software Repositories proceedings , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[11]  Michael W. Godfrey,et al.  The MSR Cookbook: Mining a decade of research , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[12]  Hridesh Rajan,et al.  Boa: A language and infrastructure for analyzing ultra-large-scale software repositories , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[13]  Eleni Stroulia,et al.  Co-evolution of project documentation and popularity within github , 2014, MSR 2014.

[14]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[15]  David Lo,et al.  Network Structure of Social Coding in GitHub , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[16]  Kevin Crowston,et al.  The Perils and Pitfalls of Mining SourceForge , 2004, MSR.

[17]  Christian Bird,et al.  Diversity in software engineering research , 2013, ESEC/FSE 2013.