What makes a popular academic AI repository?

Many AI researchers are publishing code, data and other resources that accompany their papers in GitHub repositories. In this paper, we refer to these repositories as academic AI repositories. Our preliminary study shows that highly cited papers are more likely to have popular academic AI repositories (and vice versa). Hence, in this study, we perform an empirical study on academic AI repositories to highlight good software engineering practices of popular academic AI repositories for AI researchers. We collect 1,149 academic AI repositories, in which we label the top 20% repositories that have the most number of stars as popular, and we label the bottom 70% repositories as unpopular. The remaining 10% repositories are set as a gap between popular and unpopular academic AI repositories. We propose 21 features to characterize the software engineering practices of academic AI repositories. Our experimental results show that popular and unpopular academic AI repositories are statistically significantly different in 11 of the studied features—indicating that the two groups of repositories have significantly different software engineering practices. Furthermore, we find that the number of links to other GitHub repositories in the README file, the number of images in the README file and the inclusion of a license are the most important features for differentiating the two groups of academic AI repositories. Our dataset and code are made publicly available to share with the community.

[1]  Marco Tulio Valente,et al.  Predicting the Popularity of GitHub Repositories , 2016, PROMISE.

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  Brian A. Nosek,et al.  Promoting an open research culture , 2015, Science.

[4]  David Lo,et al.  Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[5]  Julio Cesar Sampaio do Prado Leite,et al.  Extracting Requirements Patterns from Software Repositories , 2016, 2016 IEEE 24th International Requirements Engineering Conference Workshops (REW).

[6]  David W. Aha,et al.  On Reproducible AI Towards reproducible research, open science, and digital scholarship in AI publications , 2019 .

[7]  Lynne M Connelly,et al.  Fisher's Exact Test. , 2016, Medsurg nursing : official journal of the Academy of Medical-Surgical Nurses.

[8]  Miryung Kim,et al.  An ethnographic study of copy and paste programming practices in OOPL , 2004, Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE '04..

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Audris Mockus,et al.  Patterns of folder use and project popularity: a case study of github repositories , 2014, ESEM '14.

[11]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[12]  Ahmed E. Hassan,et al.  An Experience Report on Defect Modelling in Practice: Pitfalls and Challenges , 2017, 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[13]  Zhuo Yang,et al.  Influence analysis of Github repositories , 2016, SpringerPlus.

[14]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[15]  Maria-Florina Balcan,et al.  Learning to Branch , 2018, ICML.

[16]  David Lo,et al.  What are the characteristics of high-rated apps? A case study on free Android Applications , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[17]  David Lo,et al.  Automating Change-Level Self-Admitted Technical Debt Determination , 2019, IEEE Transactions on Software Engineering.

[18]  David Lo,et al.  Chaff from the Wheat: Characterizing and Determining Valid Bug Reports , 2020, IEEE Transactions on Software Engineering.

[19]  Christoph Treude,et al.  Categorizing the Content of GitHub README Files , 2018, Empirical Software Engineering.

[20]  David Lo,et al.  Why and how developers fork what from whom in GitHub , 2017, Empirical Software Engineering.

[21]  Carl E. Rasmussen,et al.  The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[22]  Marco Tulio Valente,et al.  Understanding the Factors That Impact the Popularity of GitHub Repositories , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[23]  Karl Fogel,et al.  Producing open source software - how to run a successful free software project , 2005 .

[24]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[25]  Ling Xu,et al.  Automating Aggregation for Software Quality Modeling , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[26]  Christa Boer,et al.  Correlation Coefficients: Appropriate Use and Interpretation , 2018, Anesthesia and analgesia.

[27]  J. Kimble Plain English: A Charter for Clear Writing@ (Part Three) , 1992 .

[28]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[29]  David Lo,et al.  Early prediction of merged code changes to prioritize reviewing tasks , 2018, Empirical Software Engineering.

[30]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[31]  Christian S. Collberg,et al.  Repeatability in computer systems research , 2016, Commun. ACM.

[32]  Shuiguang Deng,et al.  Characterization and Prediction of Popular Projects on GitHub , 2019, 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC).

[33]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[34]  J. H. Zar,et al.  Spearman Rank Correlation , 2005 .

[35]  G. Upton Fisher's Exact Test , 1992 .

[36]  David Lo,et al.  The Impact of Mislabeled Changes by SZZ on Just-in-Time Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[37]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[38]  Eleni Stroulia,et al.  Co-evolution of project documentation and popularity within github , 2014, MSR 2014.

[39]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[40]  Tiago L. Alves,et al.  Deriving metric thresholds from benchmark data , 2010, 2010 IEEE International Conference on Software Maintenance.

[41]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[42]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[43]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[44]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[45]  David Lo,et al.  How Practitioners Perceive Coding Proficiency , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[46]  Shane McIntosh,et al.  An Empirical Comparison of Model Validation Techniques for Defect Prediction Models , 2017, IEEE Transactions on Software Engineering.

[47]  David Lo,et al.  Perceptions, Expectations, and Challenges in Defect Prediction , 2020, IEEE Transactions on Software Engineering.

[48]  David Lo,et al.  Mining Sandboxes for Linux Containers , 2017, 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[49]  N. Cliff Ordinal methods for behavioral data analysis , 1996 .

[50]  Shane McIntosh,et al.  The Impact of Automated Parameter Optimization on Defect Prediction Models , 2018, IEEE Transactions on Software Engineering.

[51]  A. Scott,et al.  A Cluster Analysis Method for Grouping Means in the Analysis of Variance , 1974 .

[52]  Jiebo Luo,et al.  What Makes an Open Source Code Popular on Git Hub? , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[53]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[54]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[55]  Scott N. Woodfield,et al.  The effect of modularization and comments on program comprehension , 1981, ICSE '81.

[56]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[57]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.