Validity of network analyses in Open Source Projects

Social network methods are frequently used to analyze networks derived from Open Source Project communication and collaboration data. Such studies typically discover patterns in the information flow between contributors or contributions in these projects. Social network metrics have also been used to predict defect occurrence. However, such studies often ignore or side-step the issue of whether (and in what way) the metrics and networks of study are influenced by inadequate or missing data. In previous studies email archives of OSS projects have provided a useful trace of the communication and co-ordination activities of the participants. These traces have been used to construct social networks that are then subject to various types of analysis. However, during the construction of these networks, some assumptions are made, that may not always hold; this leads to incomplete, and sometimes incorrect networks. The question then becomes, do these errors affect the validity of the ensuing analysis? In this paper we specifically examine the stability of network metrics in the presence of inadequate and missing data. The issues that we study are: 1) the effect of paths with broken information flow (i.e. consecutive edges which are out of temporal order) on measures of centrality of nodes in the network, and 2) the effect of missing links on such measures. We demonstrate on three different OSS projects that while these issues do change network topology, the metrics used in the analysis are stable with respect to such changes.

[1]  Christoph Treude,et al.  How tagging helps bridge the gap between social and technical aspects in software development , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[2]  Jirí Matousek,et al.  Triangles in random graphs , 2004, Discret. Math..

[3]  Daniel M. German,et al.  Open source software peer review practices , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[4]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[5]  P. Pattison,et al.  New Specifications for Exponential Random Graph Models , 2006 .

[6]  Stephan Diehl,et al.  What dynamic network metrics can tell us about developer roles , 2008, CHASE '08.

[7]  André van der Hoek,et al.  Towards supporting awareness of indirect conflicts across software configuration management workspaces , 2007, ASE '07.

[8]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[9]  Dan Braha,et al.  From Centrality to Temporary Fame: Dynamic Centrality in Complex Networks , 2006, Complex..

[10]  Premkumar T. Devanbu,et al.  Latent social structure in open source projects , 2008, SIGSOFT '08/FSE-16.

[11]  Laurie A. Williams,et al.  Secure open source collaboration: an empirical study of linus' law , 2009, CCS.

[12]  Kevin Crowston,et al.  Validity Issues in the Use of Social Network Analysis for the Study of Online Communities , 2009 .

[13]  Robert E. Kraut,et al.  Editorial Overview - The Interplay Between Digital and Social Networks , 2008, Inf. Syst. Res..

[14]  Giuseppe Valetto,et al.  Balancing the Value and Risk of Socio-Technical Congruence , 2008 .

[15]  Tanya Y. Berger-Wolf,et al.  Betweenness Centrality Measure in Dynamic Networks , 2007 .

[16]  Anita Sarma,et al.  Tesseract: Interactive visual exploration of socio-technical relationships in software development , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[17]  Dimitrina S. Dimitrova,et al.  Computer Networks as Social Networks: Collaborative Work, Telework, and Virtual Community , 1996 .

[18]  Brendan Murphy,et al.  Can developer-module networks predict failures? , 2008, SIGSOFT '08/FSE-16.

[19]  Jules J Berman,et al.  Perl: The Programming Language , 2008 .

[20]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[21]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[22]  Michael Gertz,et al.  Mining email social networks , 2006, MSR '06.

[23]  Daniela E. Damian,et al.  Predicting build failures using social network analysis on developer communication , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[24]  Chang-Yong Lee Correlations among centrality measures in complex networks , 2006, physics/0605220.

[25]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[26]  Walid Maalej,et al.  From work to word: How do software developers describe their work? , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[27]  William A. Brenneman Statistics for Research , 2005, Technometrics.

[28]  Daniela E. Damian,et al.  Mining Task-Based Social Networks to Explore Collaboration in Software Teams , 2009, IEEE Software.

[29]  S. Dowdy,et al.  Statistics for Research: Dowdy/Statistics , 2005 .

[30]  Ahmed E. Hassan,et al.  On the use of Internet Relay Chat (IRC) meetings by developers of the GNOME GTK+ project , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[31]  Thomas Zimmermann,et al.  Improving bug triage with bug tossing graphs , 2009, ESEC/FSE '09.

[32]  Nachiappan Nagappan,et al.  Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[33]  Roy T. Fielding,et al.  The Apache HTTP Server Project , 1997, IEEE Internet Comput..

[34]  Dan Braha,et al.  The Topology of Large-Scale Engineering Problem-Solving Networks , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[35]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[36]  Victor R. Basili,et al.  The influence of organizational structure on software quality , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[37]  Laurie A. Williams,et al.  Predicting failures with developer networks and social network analysis , 2008, SIGSOFT '08/FSE-16.

[38]  William A. Brenneman Statistics for Research (3rd ed.) , 2005 .

[39]  Caroline Haythornthwaite,et al.  Automated Discovery and Analysis of Social Networks from Threaded Discussions , 2008 .

[40]  Prasun Dewan,et al.  Connecting Programming Environments to Support Ad-Hoc Collaboration , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[41]  Michael Gertz,et al.  Expertise identification and visualization from CVS , 2008, MSR '08.