GitHub Repositories with Links to Academic Papers: Open Access, Traceability, and Evolution

Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of Open Source Software implements bleeding edge science into its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the link impact remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conducted a large-scale study of 20 thousand GitHub repositories to establish prevalence of references to academic papers. We use a mixed-methods approach to identify Open Access (OA), traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are OA. In terms of traceability, our analysis revealed that machine learning is the most prevalent topic of repositories. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. A case study of referenced arXiv paper shows that most of these papers are high-impact and influential and do align with academia, referenced by repositories written in different programming languages. From the evolutionary aspect, we find very few changes of papers being referenced and links to them.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[3]  Tsuyoshi Miyakawa,et al.  No raw data, no science: another possible source of the reproducibility crisis , 2020, Molecular Brain.

[4]  Andreas Prlic,et al.  Ten Simple Rules for the Open Development of Scientific Software , 2012, PLoS Comput. Biol..

[5]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[6]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[7]  Janice Singer,et al.  How do scientists develop and use scientific software? , 2009, 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering.

[8]  Kenichi Matsumoto,et al.  An Empirical Study on README contents for JavaScript Packages , 2018, IEICE Trans. Inf. Syst..

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[11]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[13]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Sebastian Deorowicz,et al.  Even faster sorting of (not only) integers , 2017, ICMMI.

[15]  Akito Monden,et al.  From Academia to Software Development: Publication Citations in Source Code Comments , 2019, ArXiv.

[16]  Paige Rodeghero,et al.  Characterizing the Roles of Contributors in Open-Source Scientific Software Projects , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[17]  Foutse Khomh,et al.  The Open-Closed Principle of Modern Machine Learning Frameworks , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[18]  Christoph Treude,et al.  Categorizing the Content of GitHub README Files , 2018, Empirical Software Engineering.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Christoph Treude,et al.  9.6 Million Links in Source Code Comments: Purpose, Evolution, and Decay , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[21]  Matthew H Todd,et al.  Open science is a research accelerator. , 2011, Nature chemistry.

[22]  James D. Herbsleb,et al.  Scientific software production: incentives and collaboration , 2011, CSCW.

[23]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[24]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[25]  James D. Herbsleb,et al.  Incentives and integration in scientific software production , 2013, CSCW.

[26]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.