More Effective Software Repository Mining

Background: Data mining and analyzing of public Git software repositories is a growing research field. The tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the results obtained on such ``convenience samples'' generalize. Aims: This paper aims to elucidate the difficulties faced by researchers who would like to ascertain the generalizability of their findings by introducing an interface that addresses the issues with obtaining representative samples. Results: To do that we explore how to exploit the World of Code system to make software repository sampling and analysis much more accessible. Specifically, we present a resource for Mining Software Repository researchers that is intended to simplify data sampling and retrieval workflow and, through that, increase the validity and completeness of data. Conclusions: This system has the potential to provide researchers a resource that greatly eases the difficulty of data retrieval and addresses many of the currently standing issues with data sampling.

[1]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[2]  Fabian Trautsch,et al.  Adressing Problems with External Validity of Repository Mining Studies Through a Smart Data Platform , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[3]  Premkumar T. Devanbu,et al.  Wait for It: Determinants of Pull Request Evaluation Latency on GitHub , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[4]  Masahide Nakamura,et al.  Integrating Service Oriented MSR Framework and Google Chart Tools for Visualizing Software Evolution , 2012, 2012 Fourth International Workshop on Empirical Software Engineering in Practice.

[5]  Nicholas Matragkas,et al.  Crossflow: A Framework for Distributed Mining of Software Repositories , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[6]  Audris Mockus,et al.  World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[7]  Hajimu Iida,et al.  Mining the Modern Code Review Repositories: A Dataset of People, Process and Product , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[8]  Marco Tulio Valente,et al.  RefDiff: Detecting Refactorings in Version Histories , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).