A Methodology for Measuring FLOSS Ecosystems

FLOSS ecosystem as a whole is a critical component of world’s computing infrastructure, yet not well understood. In order to understand it well, we need to measure it first. We, therefore, aim to provide a framework for measuring key aspects of the entire FLOSS ecosystem. We first consider the FLOSS ecosystem through lens of a supply chain. The concept of supply chain is the existence of series of interconnected parties/affiliates each contributing unique elements and expertise so as to ensure a final solution is accessible to all interested parties. This perspective has been extremely successful in helping allowing companies to cope with multifaceted risks caused by the distributed decision-making in their supply chains, especially as they have become more global. Software ecosystems, similarly, represent distributed decisions in supply chains of code and author contributions, suggesting that relationships among projects, developers, and source code have to be measured. We then describe a massive measurement infrastructure involving discovery, extraction, cleaning, correction, and augmentation of publicly available open-source data from version control systems and other sources. We then illustrate how the key relationships among the nodes representing developers, projects, changes, and files can be accurately measured, how to handle absence of measures for user base in version control data, and, finally, illustrate how such measurement infrastructure can be used to increase knowledge resilience in FLOSS.

[1]  Michael Gertz,et al.  Mining email social networks in Postgres , 2006, MSR '06.

[2]  Chao Tian,et al.  Delayed Parity Generation in MDS Storage Codes , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[3]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[4]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[5]  Audris Mockus,et al.  Engineering big data solutions , 2014, FOSE.

[6]  Emily Hill,et al.  Degree-of-knowledge , 2014, ACM Trans. Softw. Eng. Methodol..

[7]  A. Mockus,et al.  Large-Scale Code Reuse in Open Source Software , 2007, First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07: ICSE Workshops 2007).

[8]  Audris Mockus,et al.  Succession: Measuring transfer of code and developer productivity , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[9]  J. M. Masters,et al.  Emerging Logistics Strategies , 1994 .

[10]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[11]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[12]  Audris Mockus,et al.  Organizational volatility and its effects on software defects , 2010, FSE '10.

[13]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[14]  Murat Sariyar,et al.  The RecordLinkage Package: Detecting Errors in Data , 2010, R J..

[15]  Adam Wierzbicki,et al.  GitHub Projects. Quality Analysis of Open-Source Software , 2014, SocInfo.

[16]  Audris Mockus,et al.  Expertise Browser: a quantitative approach to identifying expertise , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[17]  David Heckerman,et al.  Learning With Bayesian Networks (Abstract) , 1995, ICML.

[18]  Audris Mockus,et al.  Evaluation of source code copy detection methods on freebsd , 2008, MSR '08.

[19]  Audris Mockus,et al.  Developer fluency: achieving true mastery in software projects , 2010, FSE '10.

[20]  Stephen B. Wicker,et al.  Reed-Solomon Codes and Their Applications , 1999 .

[21]  Victor R. Basili,et al.  The influence of organizational structure on software quality , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[22]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[23]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[24]  William N. Robinson,et al.  Evolutionary Software Requirements Factors and their Effect on Open Source Project Attractiveness , 2017, HICSS.

[25]  M. Christopher Logistics and supply chain management , 2011 .

[26]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .

[27]  Audris Mockus,et al.  Patterns of folder use and project popularity: a case study of github repositories , 2014, ESEM '14.

[28]  Audris Mockus,et al.  Crowdsourcing the discovery of software repositories in an educational environment , 2016, PeerJ Prepr..

[29]  Harald C. Gall,et al.  Putting It All Together: Using Socio-technical Networks to Predict Failures , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[30]  Alessandro Bozzon,et al.  Linking Accounts across Social Networks: the Case of StackOverflow, Github and Twitter , 2015, KDWeb.

[31]  B. Fruchter,et al.  Introduction to Factor Analysis , 1955 .

[32]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[33]  Audris Mockus,et al.  Quantifying and Mitigating Turnover-Induced Knowledge Loss: Case Studies of Chrome and a Project at Avaya , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[34]  R. Gorsuch Common Factor Analysis versus Component Analysis: Some Well and Little Known Facts. , 1990, Multivariate behavioral research.

[35]  Premkumar T. Devanbu,et al.  A large scale study of programming languages and code quality in github , 2014, SIGSOFT FSE.

[36]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[37]  Audris Mockus,et al.  Amassing and indexing a large sample of version control systems: Towards the census of public source code history , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[38]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[39]  Erica R.H. Fuchs,et al.  Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records , 2014 .

[40]  Nihar B. Shah,et al.  Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction , 2010, IEEE Transactions on Information Theory.

[41]  Audris Mockus,et al.  Constructing universal version history , 2006, MSR '06.

[42]  Jie Li,et al.  A Generic Transformation to Enable Optimal Repair in MDS Codes for Distributed Storage Systems , 2016, IEEE Transactions on Information Theory.

[43]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[44]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[45]  Audris Mockus,et al.  A method to identify and correct problematic software activity data: exploiting capacity constraints and data redundancies , 2015, ESEC/SIGSOFT FSE.