Linguistic Change in Open Source Software

In this paper, we seek to advance the state-of-the-art in code evolution analysis research and practice by statistically analyzing, interpreting, and formally describing the evolution of code lexicon in Open Source Software (OSS). The underlying hypothesis is that, similar to natural language, code lexicon falls under the remit of evolutionary principles. Therefore, adapting theories and statistical models of natural language evolution to code is expected to provide unique insights into software evolution. Our analysis in this paper is conducted using 2,000 OSS systems sampled from a broad range of application domains. Our results show that a) OSS projects exhibit a significant shift in their linguistic identity over time, b) different syntactic structures of code lexicon evolve differently, c) different factors of OSS development and different maintenance activities impact code lexicon differently. These insights lay out a preliminary foundation for modeling the linguistic history of OSS projects. In the long run, this foundation will be utilized to provide support for basic software maintenance and program comprehension activities, and gain new theoretical insights into the complex interplay between linguistic change and various system and human aspects of OSS development.

[1]  Premkumar T. Devanbu,et al.  Quality and productivity outcomes relating to continuous integration in GitHub , 2015, ESEC/SIGSOFT FSE.

[2]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[3]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[4]  Anas Mahmoud,et al.  STAC: A tool for Static Textual Analysis of Code , 2016, 2016 IEEE 24th International Conference on Program Comprehension (ICPC).

[5]  Alan Boulanger Open-source versus proprietary software: Is one more reliable and secure than the other? , 2005, IBM Syst. J..

[6]  Andreas Zeller,et al.  Mining version archives for co-changed lines , 2006, MSR '06.

[7]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[8]  Sven Laumer,et al.  Who Will Remain? An Evaluation of Actual Person-Job and Person-Team Fit to Predict Developer Retention in FLOSS Projects , 2012, 2012 45th Hawaii International Conference on System Sciences.

[9]  Brian P. Bailey,et al.  Software history under the lens: A study on why and how developers examine it , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[10]  David W. Binkley,et al.  What’s in a Name? A Study of Identifiers , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[11]  Denys Poshyvanyk,et al.  An empirical exploration of regularities in open-source software lexicons , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[12]  Premkumar T. Devanbu,et al.  Gender and Tenure Diversity in GitHub Teams , 2015, CHI.

[13]  Yann-Gaël Guéhéneuc,et al.  Mining the Lexicon Used by Programmers during Sofware Evolution , 2007, 2007 IEEE International Conference on Software Maintenance.

[14]  Jure Leskovec,et al.  No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[15]  Darko Marinov,et al.  Usage, costs, and benefits of continuous integration in open-source projects , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[16]  Giuliano Antoniol,et al.  Analyzing the Evolution of the Source Code Vocabulary , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[17]  Harry Eugene Stanley,et al.  Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death , 2011, Scientific Reports.

[18]  Philippe Binder Evolution : Language use and the evolution of languages , 2009 .