Who is Who in the Mailing List? Comparing Six Disambiguation Heuristics to Identify Multiple Addresses of a Participant

Many software projects adopt mailing lists for the communication of developers and users. Researchers have been mining the history of such lists to study communities' behavior, organization, and evolution. A potential threat of this kind of study is that users often use multiple email addresses to interact in a single mailing list. This can affect the results and tools, when, for example, extracting social networks. This issue is particularly relevant for popular and long-term Open Source Software (OSS) projects, which attract participation of thousands of people. Researchers have proposed heuristics to identify multiple email addresses from the same participant, however there are few studies analyzing the effectiveness of these heuristics. In addition, many studies still do not use any heuristics for authors' disambiguation, which can compromise the results. In this paper, we compare six heuristics from the literature using data from 150 mailing lists from Apache Software Foundation projects. We found that the heuristics proposed by Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric. We also found that the time window and the size of the dataset influence the effectiveness of each heuristic. These results may help researchers and tool developers to choose the most appropriate heuristic to use, besides highlighting the necessity of dealing with identity disambiguation, mainly in open source software communities with a large number of participants.

[1]  Premkumar T. Devanbu,et al.  Validity of network analyses in Open Source Projects , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[2]  Alexander Serebrenik,et al.  Who's who in Gnome: Using LSA to merge software repository identities , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[3]  Michael W. Godfrey,et al.  The MSR Cookbook: Mining a decade of research , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[4]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[5]  Gregory W. Corder,et al.  Nonparametric Statistics : A Step-by-Step Approach , 2014 .

[6]  Gerardo Canfora,et al.  Social interactions around cross-system bug fixings: the case of FreeBSD and OpenBSD , 2011, MSR '11.

[7]  Jesús M. González-Barahona,et al.  Developer identification methods for integrated data from various sources , 2005, ACM SIGSOFT Softw. Eng. Notes.

[8]  Tom Mens,et al.  A comparison of identity merge algorithms for software repositories , 2013, Sci. Comput. Program..

[9]  Marco Aurélio Gerosa,et al.  Why do newcomers abandon open source software projects? , 2013, 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE).

[10]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[11]  G. K. Manjunath,et al.  Information Sharing and Dissemination by Use of Mailing Lists , 2003 .

[12]  Harald C. Gall,et al.  Analysing Software Repositories to Understand Software Evolution , 2008, Software Evolution.

[13]  Ahmed E. Hassan,et al.  An empirical study on the risks of using off-the-shelf techniques for processing mailing list data , 2009, 2009 IEEE International Conference on Software Maintenance.

[14]  Etm Erik Kouters Identity matching and geographical movement of open-source software mailing list participants , 2014 .

[15]  Jin Xu,et al.  A Topological Analysis of the Open Souce Software Development Community , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[16]  Alberto Bacchelli,et al.  Content classification of development emails , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[17]  Marco Aurélio Gerosa,et al.  Characterizing Key Developers: A Case Study with Apache Ant , 2012, CRIWG.

[18]  Arie van Deursen,et al.  Communication in open source software development mailing lists , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[19]  Daniel M. German,et al.  Open source software peer review practices , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[20]  Mathieu Goeminne,et al.  Understanding the evolution of socio-technical aspects in open source ecosystems , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[21]  Megan Squire Project roles in the Apache Software Foundation: A dataset , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[22]  Sandra Slaughter,et al.  Communication Networks in an Open Source Software Project , 2006, OSS.

[23]  R. Ledesma,et al.  Cliff's Delta Calculator: A non-parametric effect size program for two groups of observations , 2010 .

[24]  N. Cliff Ordinal methods for behavioral data analysis , 1996 .

[25]  Premkumar T. Devanbu,et al.  Latent social structure in open source projects , 2008, SIGSOFT '08/FSE-16.

[26]  Qi Xuan,et al.  Building it together: synchronous development in OSS , 2014, ICSE.

[27]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[28]  Gabriele Bavota,et al.  How Developers' Collaborations Identified from Different Sources Tell Us about Code Changes , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.