Public Git Archive: A Big Code Dataset for All

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive – dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for performing dataset updates and legal issues. The Public Git Archive occupies 3.0 TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for "Big Code" research.

[1]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[2]  Rohan Padhye,et al.  A study of external community contribution to open-source projects on GitHub , 2014, MSR 2014.

[3]  Eirini Kalliamvakou,et al.  An in-depth study of the promises and perils of mining GitHub , 2016, Empirical Software Engineering.

[4]  Alexander Serebrenik,et al.  Empirical Analysis of the Relationship between CC and SLOC in a Large Corpus of Java Methods , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[5]  Claire Le Goues,et al.  GenProg: A Generic Method for Automatic Software Repair , 2012, IEEE Transactions on Software Engineering.

[6]  Jing Li,et al.  The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies , 2010, 2010 Asia Pacific Software Engineering Conference.

[7]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[8]  Zheng Gao,et al.  To Type or Not to Type: Quantifying Detectable Bugs in JavaScript , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[9]  Meiyappan Nagappan,et al.  Diversity in software engineering research , 2016, Perspectives on Data Science for Software Engineering.

[10]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[11]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[12]  Georgios Gousios,et al.  Lean GHTorrent: GitHub data on demand , 2014, MSR 2014.

[13]  Michael W. Godfrey,et al.  Cloning by accident: an empirical study of source code cloning across software systems , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[14]  Mark Harman,et al.  Automated software transplantation , 2015, ISSTA.

[15]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[16]  Benoit Baudry,et al.  On Analyzing the Topology of Commit Histories in Decentralized Version Control Systems , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[17]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[18]  Baishakhi Ray,et al.  Some from Here, Some from There: Cross-Project Code Reuse in GitHub , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[19]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[20]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[21]  Jordi Cabot,et al.  Findings from GitHub: Methods, Datasets and Limitations , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[22]  Hridesh Rajan,et al.  A study of repetitiveness of code changes in software evolution , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Roberto Di Cosmo,et al.  Software Heritage: Why and How to Preserve Software Source Code , 2017, iPRES.