The Role of Data Filtering in Open Source Software Ranking and Selection

Faced with over 100M open source projects most empirical investigations select a subset. Most research papers in leading venues investigated filtering projects by some measure of popularity with explicit or implicit arguments that unpopular projects are not of interest, may not even represent"real"software projects, or that less popular projects are not worthy of study. However, such filtering may have enormous effects on the results of the studies if and precisely because the sought-out response or prediction is in any way related to the filtering criteria. We exemplify the impact of this practice on research outcomes: how filtering of projects listed on GitHub affects the assessment of their popularity. We randomly sample over 100,000 repositories and use multiple regression to model the number of stars (a proxy for popularity) based on the number of commits, the duration of the project, the number of authors, and the number of core developers. Comparing control with the entire dataset with a filtered model projects having ten or more authors we find that while certain characteristics of the repository consistently predict popularity, the filtering process significantly alters the relation ships between these characteristics and the response. The number of commits exhibited a positive correlation with popularity in the control sample but showed a negative correlation in the filtered sample. These findings highlight the potential biases introduced by data filtering and emphasize the need for careful sample selection in empirical research of mining software repositories. We recommend that empirical work should either analyze complete datasets such as World of Code, or employ stratified random sampling from a complete dataset to ensure that filtering is not biasing the results.

[1]  David Lo,et al.  A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects , 2021, IEEE Transactions on Software Engineering.

[2]  G. Bavota,et al.  Sampling Projects in GitHub for MSR Studies , 2021, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).

[3]  Audris Mockus,et al.  World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data , 2020, Empirical Software Engineering.

[4]  Audris Mockus,et al.  More Effective Software Repository Mining , 2020, ArXiv.

[5]  P. Ralph,et al.  Sampling in software engineering research: a critical review and guidelines , 2020, Empirical Software Engineering.

[6]  Shuiguang Deng,et al.  Characterization and Prediction of Popular Projects on GitHub , 2019, 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC).

[7]  Zibin Zheng,et al.  Software Service Recommendation Base on Collaborative Filtering Neural Network Model , 2018, ICSOC.

[8]  Nuthan Munaiah,et al.  Curating GitHub for engineered software projects , 2017, Empirical Software Engineering.

[9]  Jordi Cabot,et al.  A Systematic Mapping Study of Software Development With GitHub , 2017, IEEE Access.

[10]  Zhuo Yang,et al.  Influence analysis of Github repositories , 2016, SpringerPlus.

[11]  Marco Tulio Valente,et al.  Understanding the Factors That Impact the Popularity of GitHub Repositories , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[12]  Jordi Cabot,et al.  Findings from GitHub: Methods, Datasets and Limitations , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[13]  Cristina V. Lopes,et al.  Is Popularity a Measure of Quality? An Analysis of Maven Components , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[14]  Audris Mockus,et al.  Engineering big data solutions , 2014, FOSE.

[15]  Jing Li,et al.  The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies , 2010, 2010 Asia Pacific Software Engineering Conference.

[16]  Sushil Krishna Bajracharya,et al.  Leveraging usage similarity for effective retrieval of examples in code repositories , 2010, FSE '10.

[17]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[18]  Mel Ó Cinnéide,et al.  A Recommender Agent for Software Libraries: An Evaluation of Memory-Based and Model-Based Collaborative Filtering , 2006, 2006 IEEE/WIC/ACM International Conference on Intelligent Agent Technology.

[19]  Yair Weiss,et al.  Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence , 2003, NIPS.

[20]  Shinji Kusumoto,et al.  Ranking significance of software components based on use relations , 2003, IEEE Transactions on Software Engineering.

[21]  Subir Ghosh,et al.  Statistical Analysis With Missing Data , 1988 .

[22]  N. Jewell,et al.  Regression analysis based on stratified samples , 1986 .

[23]  J. Neyman On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection , 1934 .