Representation of Developer Expertise in Open Source Software

With tens of millions of projects and developers, the OSS ecosystem is both vibrant and intimidating. On one hand, it hosts the source code for the most critical infrastructures and has the most brilliant developers as contributors, while on the other hand, poor quality or even malicious software, and novice developers abound. External contributions are critical to OSS projects, but the chances their contributions are accepted or even considered depend on the trust between maintainers and contributors. Such trust is built over repeated interactions and coding platforms provide signals of project or developer quality via measures of activity (commits), and social relationships (followers/stars) to facilitate trust. These signals, however, do not represent the specific expertise of a developer. We, therefore, aim to address this gap by defining the skill space for APIs, developers, and projects that reflects what developers know (and projects need) more precisely than could be obtained via aggregate activity counts, and more generally than pointing to individual files developers have changed in the past. Specifically, we use the World of Code infrastructure to extract the complete set of APIs in the files changed by all open source developers. We use that data to represent APIs, developers, and projects in the skill space, and evaluate if the alignment measures in the skill space can predict whether or not the developers use new APIs, join new projects, or get their pull requests accepted. We also check if the developers' representation in the skill space aligns with their self-reported expertise. Our results suggest that the proposed embedding in the skill space achieves our aims and may serve not only as a signal to increase trust (and efficiency) of open source ecosystems, but may also allow more detailed investigations of other phenomena related to developer proficiency and learning.

[1]  Audris Mockus,et al.  Detecting and Characterizing Bots that Commit Code , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[2]  Audris Mockus,et al.  Effectiveness of code contribution: from patch-based to pull-request-based tools , 2016, SIGSOFT FSE.

[3]  Audris Mockus,et al.  Developer fluency: achieving true mastery in software projects , 2010, FSE '10.

[4]  Victor R. Basili,et al.  An Investigation of Human Factors in Software Development , 1979, Computer.

[5]  Eric S. Raymond,et al.  The Cathedral and the Bazaar , 2000 .

[6]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[7]  Liang Chen,et al.  SCSMiner: mining social coding sites for software developer recommendation with relevance propagation , 2018, World Wide Web.

[8]  Audris Mockus,et al.  A Methodology for Measuring FLOSS Ecosystems , 2019, Towards Engineering Free/Libre Open Source Software (FLOSS) Ecosystems for Impact and Sustainability.

[9]  Audris Mockus,et al.  World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data , 2020, Empirical Software Engineering.

[10]  Christoph Treude,et al.  Mutual assessment in the social programmer ecosystem: an empirical investigation of developer profile aggregators , 2013, CSCW.

[11]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[12]  Audris Mockus,et al.  Analysis of Popularity of Game Mods: A Case Study , 2016, CHI PLAY.

[13]  Audris Mockus,et al.  Which Pull Requests Get Accepted and Why? A study of popular NPM Packages , 2020, ArXiv.

[14]  Chen Zhang,et al.  Emergence of New Project Teams from Open Source Software Developer Networks: Impact of Prior Collaboration Ties , 2008, Inf. Syst. Res..

[15]  Lori L. Pollock,et al.  Exploring Word Embedding Techniques to Improve Sentiment Analysis of Software Engineering Texts , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[16]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[17]  B. Curtis,et al.  Substantiating programmer variability , 1981, Proceedings of the IEEE.

[18]  Tom Van Cutsem,et al.  Import2vec: Learning Embeddings for Software Libraries , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[19]  D. L. Parnas,et al.  On the criteria to be used in decomposing systems into modules , 1972, Software Pioneers.

[20]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[21]  Audris Mockus,et al.  An Exploratory Study of Bot Commits , 2020, ICSE.

[22]  Audris Mockus,et al.  Modeling Relationship between Post-Release Faults and Usage in Mobile Software , 2018, PROMISE.

[23]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[24]  Arie van Deursen,et al.  An exploratory study of the pull-based software development model , 2014, ICSE.

[25]  Yuxing Ma,et al.  Constructing Supply Chains in Open Source Software , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[26]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[27]  Audris Mockus,et al.  Developer Reputation Estimator (DRE) , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[28]  Bernd Fischer,et al.  CVExplorer: Identifying candidate developers by mining and exploring their open source contributions , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[29]  Audris Mockus,et al.  Expertise Browser: a quantitative approach to identifying expertise , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[30]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[31]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[32]  Audris Mockus,et al.  Effect of Technical and Social Factors on Pull Request Quality for the NPM Ecosystem , 2020, ESEM.

[33]  Emerson R. Murphy-Hill,et al.  A degree-of-knowledge model to capture source code familiarity , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[34]  Jan Maarten Schraagen,et al.  Factual accuracy and trust in information: The role of expertise , 2011, J. Assoc. Inf. Sci. Technol..

[35]  Jennifer Marlow,et al.  Activity traces and signals in software developer recruitment and hiring , 2013, CSCW.

[36]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[37]  Mark Harman,et al.  SapFix: Automated End-to-End Repair at Scale , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[38]  Audris Mockus,et al.  A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits , 2020, MSR.

[39]  Jane Cleland-Huang,et al.  Semantically Enhanced Software Traceability Using Deep Learning Techniques , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[40]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[41]  James D. Herbsleb,et al.  Impression formation in online peer production: activity traces and personal profiles in github , 2013, CSCW.

[42]  Philip J. Guo,et al.  Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[43]  Audris Mockus,et al.  A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits , 2020, MSR.

[44]  Audris Mockus,et al.  World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[45]  Georgios Gousios,et al.  Matching GitHub Developer Profiles to Job Advertisements , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[46]  Audris Mockus,et al.  Are Software Dependency Supply Chain Metrics Useful in Predicting Change of Popularity of NPM Packages? , 2018, PROMISE.

[47]  Anita Sarma,et al.  The onion patch: migration in open source ecosystems , 2011, ESEC/FSE '11.

[48]  James D. Herbsleb,et al.  Influence of social and technical factors for evaluating contribution in GitHub , 2014, ICSE.

[49]  Marco Tulio Valente,et al.  Identifying Experts in Software Libraries and Frameworks Among GitHub Users , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[50]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[51]  Charles A. Behrens,et al.  Measuring the Productivity of Computer Systems Development Activities with Function Points , 1983, IEEE Transactions on Software Engineering.

[52]  Audris Mockus,et al.  Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem , 2019, PROMISE.

[53]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[54]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[55]  Audris Mockus,et al.  ALFAA: Active Learning Fingerprint Based Anti-Aliasing for Correcting Developer Identity Errors in Version Control Data , 2019, ArXiv.

[56]  Audris Mockus,et al.  Deriving a usage-independent software quality metric , 2020, Empirical Software Engineering.