The promises and perils of mining GitHub

With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of the repositories in GitHub and how users take advantage of GitHub's main features---namely commits, pull requests, and issues. Our results indicate that, while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration. We show, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were. We provide a set of recommendations for software engineering researchers on how to approach the data in GitHub.

[1]  Premkumar T. Devanbu,et al.  Latent social structure in open source projects , 2008, SIGSOFT '08/FSE-16.

[2]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[3]  Janice Singer,et al.  Hipikat: a project memory for software development , 2005, IEEE Transactions on Software Engineering.

[4]  Jan Bosch,et al.  Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder , 2013, IEEE Software.

[5]  T. Zimmermann,et al.  Predicting Faults from Cached History , 2007, 29th International Conference on Software Engineering (ICSE'07).

[6]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[7]  James D. Herbsleb,et al.  Social coding in GitHub: transparency and collaboration in an open software repository , 2012, CSCW.

[8]  Kevin Crowston,et al.  The Perils and Pitfalls of Mining SourceForge , 2004, MSR.

[9]  Thomas Zimmermann,et al.  Preprocessing CVS Data for Fine-Grained Analysis , 2004, MSR.

[10]  Michael D. Ernst,et al.  Prioritizing Warning Categories by Analyzing Software History , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[11]  Thomas Zimmermann,et al.  Automatic Identification of Bug-Introducing Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[12]  Premkumar T. Devanbu,et al.  Detecting Patch Submission and Acceptance in OSS Projects , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[13]  Premkumar T. Devanbu,et al.  The missing links: bugs and bug-fix commits , 2010, FSE '10.

[14]  James D. Herbsleb,et al.  Impression formation in online peer production: activity traces and personal profiles in github , 2013, CSCW.

[15]  Yuri Takhteyev,et al.  Investigating the Geography of Open Source Software through Github , 2010 .

[16]  Premkumar T. Devanbu,et al.  Fair and balanced?: bias in bug-fix datasets , 2009, ESEC/FSE '09.

[17]  Harald C. Gall,et al.  Software evolution: analysis and visualization , 2006, ICSE '06.

[18]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[19]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[20]  David Lo,et al.  Network Structure of Social Coding in GitHub , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[21]  James D. Herbsleb,et al.  Social media and success in open source projects , 2012, CSCW.

[22]  Daniel M. German,et al.  Open source software peer review practices , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[23]  Premkumar T. Devanbu,et al.  Open Borders? Immigration in Open Source Projects , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[24]  Weiss Dawid,et al.  Quantitative analysis of open source projects on SourceForge , 2005 .

[25]  Premkumar T. Devanbu,et al.  Sample size vs. bias in defect prediction , 2013, ESEC/FSE 2013.

[26]  Eirini Kalliamvakou The Code-Centric Collaboration Perspective : Evidence from GitHub , 2014 .

[27]  Anita Sarma,et al.  A network of Rails a graph dataset of Ruby on Rails and associated projects , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[28]  Christian Bird,et al.  Convergent contemporary software peer review practices , 2013, ESEC/FSE 2013.

[29]  Gina Venolia,et al.  The secret life of bugs: Going past the errors and omissions in software repositories , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[30]  Harald C. Gall,et al.  Populating a Release History Database from version control and bug tracking systems , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[31]  Alberto Bacchelli,et al.  Expectations, outcomes, and challenges of modern code review , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[32]  Sean P. Goggins,et al.  Performance and participation in open source software on GitHub , 2013, CHI Extended Abstracts.

[33]  Georgios Gousios,et al.  A Dataset for Pull Request Research , 2014 .

[34]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[35]  Austen Rainer,et al.  Evaluating the Quality and Quantity of Data on Open Source Software Projects , 2005 .

[36]  Ahmed E. Hassan,et al.  A Case Study of Bias in Bug-Fix Datasets , 2010, 2010 17th Working Conference on Reverse Engineering.

[37]  Gerardo Canfora,et al.  Identifying Changed Source Code Lines from Version Repositories , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[38]  Harald C. Gall,et al.  Fine-grained analysis of change couplings , 2005, Fifth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'05).

[39]  Robert Love,et al.  Linux Kernel Development , 2003 .

[40]  Leif Singer,et al.  Creating a shared understanding of testing culture on a social coding site , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[41]  Harald C. Gall,et al.  Visualizing multiple evolution metrics , 2005, SoftVis '05.