ALFAA: Active Learning Fingerprint Based Anti-Aliasing for Correcting Developer Identity Errors in Version Control Data

An accurate determination of developer identities is important for software engineering research and practice. Without it, even simple questions such as “how many developers does a project have?” cannot be answered. The commonly used version control data from Git is full of identity errors and the existing approaches to correct these errors are difficult to validate on large scale and cannot be easily improved. We, therefore, aim to develop a scalable, highly accurate, easy to use and easy to improve approach to correct software developer identity errors. We first amalgamate developer identities from version control systems in open source software repositories and investigate the nature and prevalence of these errors, design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using a collection of over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with three behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit supervised learning models to predict the identities for the remaining author strings in OpenStack. We then compare these predictions with a competing commercially available effort and a leading research method. Finally, we compare network measures for file-induced author networks based on corrected and raw data. We find commits done from different environments, misspellings, organizational ids, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing research and commercial methods and the active learning approach to be an effective way to create validated datasets. Results also indicate that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that involve developer identities.

[1]  Andrew Nesbitt,et al.  Libraries.io Open Source Repository and Dependency Metadata , 2017 .

[2]  Audris Mockus,et al.  A method to identify and correct problematic software activity data: exploiting capacity constraints and data redundancies , 2015, ESEC/SIGSOFT FSE.

[3]  Alexander Serebrenik,et al.  A Data Set for Social Diversity Studies of GitHub Teams , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[4]  Alexander Serebrenik,et al.  Who's who in Gnome: Using LSA to merge software repository identities , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[5]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[6]  Tom Mens,et al.  A comparison of identity merge algorithms for software repositories , 2013, Sci. Comput. Program..

[7]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[8]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[9]  Jun Wei,et al.  An Empirical Study of Multiple Names and Email Addresses in OSS Version Control Repositories , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[10]  Audris Mockus,et al.  Inflow and Retention in OSS Communities with Commercial Involvement , 2016, ACM Trans. Softw. Eng. Methodol..

[11]  Brendan Murphy,et al.  CODEMINE: Building a Software Development Data Analytics Platform at Microsoft , 2013, IEEE Software.

[12]  Audris Mockus,et al.  World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[13]  Audris Mockus,et al.  Expertise Browser: a quantitative approach to identifying expertise , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[14]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[15]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[16]  Stephan Diehl,et al.  Usage and attribution of Stack Overflow code snippets in GitHub projects , 2018, Empirical Software Engineering.

[17]  Daniela E. Damian,et al.  Mining Task-Based Social Networks to Explore Collaboration in Software Teams , 2009, IEEE Software.

[18]  Rafael Wittek,et al.  Structural Holes , 2014, Encyclopedia of Social Network Analysis and Mining.

[19]  Murat Sariyar,et al.  The RecordLinkage Package: Detecting Errors in Data , 2010, R J..

[20]  George Ostrouchov,et al.  Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes , 2017, Big Data Res..

[21]  Beijun Shen,et al.  Mining Developer Behavior Across GitHub and StackOverflow , 2017, SEKE.

[22]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[23]  Brendan Murphy,et al.  Can developer-module networks predict failures? , 2008, SIGSOFT '08/FSE-16.

[24]  Audris Mockus,et al.  Engineering big data solutions , 2014, FOSE.

[25]  Alessandro Bozzon,et al.  Linking Accounts across Social Networks: the Case of StackOverflow, Github and Twitter , 2015, KDWeb.

[26]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[27]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[28]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[29]  Eleni Stroulia,et al.  Involvement, contribution and influence in GitHub and stack overflow , 2014, CASCON.

[30]  Victor R. Basili,et al.  The influence of organizational structure on software quality , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[31]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[32]  Audris Mockus,et al.  Amassing and indexing a large sample of version control systems: Towards the census of public source code history , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[33]  Daniel German,et al.  Mining CVS repositories, the softChange experience , 2004, MSR.

[34]  David Lo,et al.  Network Structure of Social Coding in GitHub , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[35]  Audris Mockus,et al.  Succession: Measuring transfer of code and developer productivity , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[36]  Jure Leskovec,et al.  Measurement error in network data: A re-classification , 2012, Soc. Networks.

[37]  James D. Herbsleb,et al.  Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity , 2008, ESEM '08.

[38]  Kevin A Hallgren,et al.  Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. , 2012, Tutorials in quantitative methods for psychology.

[39]  Claes Wohlin,et al.  Measuring the flow in lean software development , 2011, Softw. Pract. Exp..

[40]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[41]  Gregorio Robles,et al.  Impact of libre software tools and methods in the robotics field , 2005, ACM SIGSOFT Softw. Eng. Notes.

[42]  James D. Herbsleb,et al.  Identification of coordination requirements: implications for the Design of collaboration and awareness tools , 2006, CSCW '06.

[43]  Christoph Treude,et al.  Categorizing the Content of GitHub README Files , 2018, Empirical Software Engineering.

[44]  David Budgen,et al.  Empirical Software Engineering , 2014, Computing Handbook, 3rd ed..

[45]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[46]  Juan Martínez-Romo,et al.  Using Social Network Analysis Techniques to Study Collaboration between a FLOSS Community and a Company , 2008, OSS.

[47]  Audris Mockus,et al.  Automating the Measurement of Open Source Projects , 2003 .

[48]  Daniela E. Damian,et al.  The promises and perils of mining GitHub , 2009, MSR 2014.

[49]  Jesús M. González-Barahona,et al.  Developer identification methods for integrated data from various sources , 2005, ACM SIGSOFT Softw. Eng. Notes.

[50]  Anita Sarma,et al.  The onion patch: migration in open source ecosystems , 2011, ESEC/FSE '11.

[51]  Bogdan Vasilescu,et al.  Developer initiation and social interactions in OSS: A case study of the Apache Software Foundation , 2015, Empirical Software Engineering.

[52]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[53]  Christoph Treude,et al.  Who is Who in the Mailing List? Comparing Six Disambiguation Heuristics to Identify Multiple Addresses of a Participant , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[54]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[55]  Erica R.H. Fuchs,et al.  Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records , 2014 .

[56]  Dana Edberg,et al.  User-Developed Applications: An Empirical Study of Application Quality and Developer Productivity , 1996, J. Manag. Inf. Syst..

[57]  Neil R. Smalheiser,et al.  Author name disambiguation , 2009, Annu. Rev. Inf. Sci. Technol..