Trends That Affect Temporal Analysis Using SourceForge Data

SourceForge is a valuable source of software artifact data for researchers who study project evolution and developer behavior. However, the data exhibit patterns that may bias temporal analyses. Most notable are cliff walls in project source code repository timelines, which indicate large commits that are out of character for the given project. These cliff walls often hide significant periods of development and developer collaboration—a threat to studies that rely on SourceForge repository data. We demonstrate how to identify these cliff walls, discuss reasons for their appearance, and propose preliminary measures for mitigating their effects in evolution-oriented studies.

[1]  Akito Monden,et al.  Software Analysis by Code Clones in Open Source Software , 2005, J. Comput. Inf. Syst..

[2]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[3]  Sandro Morasca,et al.  Defining and Validating Measures for Object-Based High-Level Design , 1999, IEEE Trans. Software Eng..

[4]  Premkumar T. Devanbu,et al.  Open Borders? Immigration in Open Source Projects , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[5]  Daniel P. Delorey Programming Language Trends in Open Source Development: An Evaluation Using Data from All Production Phase SourceForge Projects , 2007 .

[6]  Premkumar T. Devanbu,et al.  Latent social structure in open source projects , 2008, SIGSOFT '08/FSE-16.

[7]  Alexander Tarvo Mining Software History to Improve Software Maintenance Quality: A Case Study , 2009, IEEE Software.

[8]  Ahmed E. Hassan,et al.  Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[9]  Paul Dourish,et al.  Seeking the source: software source code as a social and technical artifact , 2005, GROUP.

[10]  Kouichi Kishida,et al.  Evolution patterns of open-source software systems and communities , 2002, IWPSE '02.

[11]  Harald C. Gall,et al.  Populating a Release History Database from version control and bug tracking systems , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[12]  Jonathan L. Krein,et al.  Language Entropy : A Metric for Characterization of Author Programming Language Distribution , 2009 .

[13]  Jin Xu,et al.  A Topological Analysis of the Open Souce Software Development Community , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[14]  Nicolas Ducheneaut,et al.  Socialization in an Open Source Software Community: A Socio-Technical Analysis , 2005, Computer Supported Cooperative Work (CSCW).

[15]  Kevin Crowston,et al.  The Perils and Pitfalls of Mining SourceForge , 2004, MSR.

[16]  Charles D. Knutson,et al.  Impact of Programming Language Fragmentation on Developer Productivity: A Sourceforge Empirical Study , 2010, Int. J. Open Source Softw. Process..

[17]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[18]  P. McDonald Estimating the Effective Size of Auto-Generated Code in a Large Software Project , 2002 .

[19]  D.P. Delorey,et al.  Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects , 2007, First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07: ICSE Workshops 2007).