The promises and perils of mining git

We are now witnessing the rapid growth of decentralized source code management (DSCM) systems, in which every developer has her own repository. DSCMs facilitate a style of collaboration in which work output can flow sideways (and privately) between collaborators, rather than always up and down (and publicly) via a central repository. Decentralization comes with both the promise of new data and the peril of its misinterpretation. We focus on git, a very popular DSCM used in high-profile projects. Decentralization, and other features of git, such as automatically recorded contributor attribution, lead to richer content histories, giving rise to new questions such as “How do contributions flow between developers to the official project repository?” However, there are pitfalls. Commits may be reordered, deleted, or edited as they move between repositories. The semantics of terms common to SCMs and DSCMs sometimes differ markedly, potentially creating confusion. For example, a commit is immediately visible to all developers in centralized SCMs, but not in DSCMs. Our goal is to help researchers interested in DSCMs avoid these and other perils when mining and analyzing git data.

[1]  Bart Massey Regurgitate : Using GIT For F / LOSS Data Collection , 2006 .

[2]  Harald C. Gall,et al.  Visualizing multiple evolution metrics , 2005, SoftVis '05.

[3]  Gerardo Canfora,et al.  Identifying Changed Source Code Lines from Version Repositories , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[4]  Harald C. Gall,et al.  Populating a Release History Database from version control and bug tracking systems , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[5]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[6]  Michael D. Ernst,et al.  Prioritizing Warning Categories by Analyzing Software History , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[7]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[8]  Thomas Zimmermann,et al.  Preprocessing CVS Data for Fine-Grained Analysis , 2004, MSR.

[9]  Thomas Zimmermann,et al.  Automatic Identification of Bug-Introducing Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[10]  Premkumar T. Devanbu,et al.  Detecting Patch Submission and Acceptance in OSS Projects , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[11]  Harald C. Gall,et al.  Fine-grained analysis of change couplings , 2005, Fifth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'05).

[12]  Harald C. Gall,et al.  Software evolution: analysis and visualization , 2006, ICSE '06.

[13]  Daniel M. German,et al.  Open source software peer review practices , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[14]  Premkumar T. Devanbu,et al.  Latent social structure in open source projects , 2008, SIGSOFT '08/FSE-16.

[15]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[16]  Robert Love,et al.  Linux Kernel Development (2nd Edition) (Novell Press) , 2005 .

[17]  Premkumar T. Devanbu,et al.  Open Borders? Immigration in Open Source Projects , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[18]  Janice Singer,et al.  Hipikat: a project memory for software development , 2005, IEEE Transactions on Software Engineering.