Software evolution: the lifetime of fine-grained elements

A model regarding the lifetime of individual source code lines or tokens can estimate maintenance effort, guide preventive maintenance, and, more broadly, identify factors that can improve the efficiency of software development. We present methods and tools that allow tracking of each line’s or token’s birth and death. Through them, we analyze 3.3 billion source code element lifetime events in 89 revision control repositories. Statistical analysis shows that code lines are durable, with a median lifespan of about 2.4 years, and that young lines are more likely to be modified or deleted, following a Weibull distribution with the associated hazard rate decreasing over time. This behavior appears to be independent from specific characteristics of lines or tokens, as we could not determine factors that influence significantly their longevity across projects. The programing language, and developer tenure and experience were not found to be significantly correlated with line or token longevity, while project size and project age showed only a slight correlation.

[1]  Domenico Cotroneo,et al.  Predicting aging-related bugs using software complexity metrics , 2013, Perform. Evaluation.

[2]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[3]  David Garlan,et al.  Automated planning for software architecture evolution , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Guilherme Horta Travassos,et al.  Towards a model to support in silico studies of software evolution , 2012, Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement.

[5]  Jesús M. González-Barahona,et al.  Towards a Theoretical Model for Software Growth , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[6]  N. Breslow,et al.  Introduction to Kaplan and Meier (1958) Nonparametric Estimation from Incomplete Observations , 1992 .

[7]  Mark Harman,et al.  Genetic Improvement of Software: A Comprehensive Survey , 2018, IEEE Transactions on Evolutionary Computation.

[8]  Jun Yan Survival Analysis: Techniques for Censored and Truncated Data , 2004 .

[9]  Raymond P. L. Buse,et al.  A metric for software readability , 2008, ISSTA '08.

[10]  Stephen H. Kan,et al.  Metrics and Models in Software Quality Engineering , 1994, SOEN.

[11]  Lucian Voinea,et al.  CVSscan: visualization of code evolution , 2005, SoftVis '05.

[12]  David Lorge Parnas,et al.  Software aging , 1994, Proceedings of 16th International Conference on Software Engineering.

[13]  Michiel van Genuchten,et al.  Metrics with Impact , 2013, IEEE Software.

[14]  Claire Le Goues,et al.  Current challenges in automatic software repair , 2013, Software Quality Journal.

[15]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[16]  Daniel M. Germán,et al.  Continuously mining distributed version control systems: an empirical study of how Linux uses Git , 2014, Empirical Software Engineering.

[17]  Xiang Li,et al.  Reliability analysis and optimal version-updating for open source software , 2011, Inf. Softw. Technol..

[18]  Ioannis Stamelos,et al.  A statistical framework for analyzing the duration of software projects , 2008, Empirical Software Engineering.

[19]  N. L. Johnson,et al.  Survival Models and Data Analysis , 1982 .

[20]  Audris Mockus,et al.  Using Version Control Data to Evaluate the Impact of Software Tools: A Case Study of the Version Editor , 2002, IEEE Trans. Software Eng..

[21]  Andreas Zeller,et al.  Mining version archives for co-changed lines , 2006, MSR '06.

[22]  Stuart E. Schechter,et al.  Milk or Wine: Does Software Security Improve with Age? , 2006, USENIX Security Symposium.

[23]  Hideaki Hata,et al.  How different are different diff algorithms in Git? , 2019, Empirical Software Engineering.

[24]  Robert L. Nord,et al.  Technical Debt: From Metaphor to Theory and Practice , 2012, IEEE Software.

[25]  Diomidis Spinellis,et al.  The long‐term growth rate of evolving software: Empirical results and implications , 2017, J. Softw. Evol. Process..

[26]  Thomas Zimmermann,et al.  Fine-grained processing of CVS archives with APFEL , 2006, ETX.

[27]  Lerina Aversano,et al.  The life and death of statically detected vulnerabilities: An empirical study , 2009, Inf. Softw. Technol..

[28]  Miryung Kim,et al.  Program element matching for multi-version program analyses , 2006, MSR '06.

[29]  Marlon Dumas,et al.  Code churn estimation using organisational and code metrics: An experimental comparison , 2012, Inf. Softw. Technol..

[30]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[31]  Osamu Mizuno,et al.  Historage: fine-grained version control system for Java , 2011, IWPSE-EVOL '11.

[32]  William J. Padgett,et al.  Weibull Distribution , 2011, International Encyclopedia of Statistical Science.

[33]  Georgios Gousios,et al.  Mining Software Engineering Data from GitHub , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[34]  Harald C. Gall,et al.  Comparing fine-grained source code changes and code churn for bug prediction , 2011, MSR '11.

[35]  Tom Mens,et al.  Towards a survival analysis of database framework usage in Java projects , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[36]  Jesús M. González-Barahona,et al.  Evolution and growth in large libre software projects , 2005, Eighth International Workshop on Principles of Software Evolution (IWPSE'05).

[37]  James A. Jones,et al.  Fuzzy Fine-Grained Code-History Analysis , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[38]  Gerardo Canfora,et al.  Identifying Changed Source Code Lines from Version Repositories , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[39]  Gabriele Bavota,et al.  The Evolution of Project Inter-dependencies in a Software Ecosystem: The Case of Apache , 2013, 2013 IEEE International Conference on Software Maintenance.

[40]  Hung Viet Nguyen,et al.  Detection of embedded code smells in dynamic web applications , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[41]  Tom Mens,et al.  A Historical Analysis of Debian Package Incompatibilities , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[42]  M.M. Lehman,et al.  Programs, life cycles, and laws of software evolution , 1980, Proceedings of the IEEE.

[43]  Audris Mockus,et al.  Does Code Decay? Assessing the Evidence from Change Management Data , 2001, IEEE Trans. Software Eng..

[44]  Michael W. Godfrey,et al.  Evolution in open source software: a case study , 2000, Proceedings 2000 International Conference on Software Maintenance.

[45]  Watts S. Humphrey,et al.  Managing the software process , 1989, The SEI series in software engineering.

[46]  Chanchal Kumar Roy,et al.  LHDiff: A Language-Independent Hybrid Approach for Tracking Source Code Lines , 2013, 2013 IEEE International Conference on Software Maintenance.

[47]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[48]  Meiyappan Nagappan,et al.  Curating GitHub for engineered software projects , 2017, Empirical Software Engineering.

[49]  Elena García Barriocanal,et al.  Empirical findings on team size and productivity in software development , 2012, J. Syst. Softw..

[50]  Matias Martinez,et al.  Do the fix ingredients already exist? an empirical inquiry into the redundancy assumptions of program repair approaches , 2014, ICSE Companion.

[51]  David Gries Programming Methodology: A Collection of Articles by Members of IFIP WG 2.3 , 1978 .

[52]  Georgios Gousios,et al.  GHTorrent: Github's data from a firehose , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[53]  Ivica Crnkovic,et al.  A systematic review of software architecture evolution research , 2012, Inf. Softw. Technol..

[54]  Hongyu Zhang,et al.  An investigation of the relationships between lines of code and defects , 2009, 2009 IEEE International Conference on Software Maintenance.

[55]  Thomas A. Henzinger,et al.  Probabilistic programming , 2014, FOSE.

[56]  Steven N. Austad,et al.  Why do we age? , 2000, Nature.

[57]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[58]  Michael W. Godfrey,et al.  Facilitating software evolution research with kenyon , 2005, ESEC/FSE-13.

[59]  Daniel M. Germán,et al.  Macro-level software evolution: a case study of a large software compilation , 2009, Empirical Software Engineering.

[60]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[61]  Harald C. Gall,et al.  Software evolution observations based on product release history , 1997, 1997 Proceedings International Conference on Software Maintenance.

[62]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[63]  Audris Mockus,et al.  A Dataset for GitHub Repository Deduplication , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[64]  Matias Martinez,et al.  Fine-grained and accurate source code differencing , 2014, ASE.

[65]  Diomidis Spinellis,et al.  A repository of Unix history and evolution , 2017, Empirical Software Engineering.

[66]  Harald C. Gall,et al.  Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction , 2007, IEEE Transactions on Software Engineering.

[67]  Mauricio A. Saca Refactoring improving the design of existing code , 2017, 2017 IEEE 37th Central America and Panama Convention (CONCAPAN XXXVII).

[68]  Ioannis Stamelos,et al.  Code quality analysis in open source software development , 2002, Inf. Syst. J..

[69]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[70]  Jonas Gamalielsson,et al.  Sustainability of Open Source software communities beyond a fork: How and why has the LibreOffice project evolved? , 2014, J. Syst. Softw..

[71]  M. Kechagia,et al.  Effective and Efficient API Misuse Detection via Exception Propagation and Search-based Testing , 2019 .

[72]  Jesús M. González-Barahona,et al.  The evolution of the laws of software evolution , 2013, ACM Comput. Surv..

[73]  Magne Jørgensen,et al.  Numerical anchors and their strong effects on software development effort estimates , 2016, J. Syst. Softw..

[74]  Diomidis Spinellis,et al.  Does Your Configuration Code Smell? , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[75]  M. Mäntylä,et al.  Subjective evaluation of software evolvability using code smells: An empirical study , 2006, Empirical Software Engineering.

[76]  K. Vairavan,et al.  An Experimental Investigation of Software Metrics and Their Relationship to Software Development Effort , 1989, IEEE Trans. Software Eng..

[77]  Hridesh Rajan,et al.  A study of repetitiveness of code changes in software evolution , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[78]  Darrel C. Ince,et al.  The case for open computer programs , 2012, Nature.

[79]  Daniel M. German,et al.  cregit: Token-level blame information in git version control repositories , 2019, Empirical Software Engineering.

[80]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[81]  Meir M. Lehman Programs, life cycles, and laws of software evolution , 1980 .

[82]  Gregorio Robles,et al.  An Empirical Approach to Software Archaeology , 2005 .

[83]  Jesús M. González-Barahona,et al.  Studying the laws of software evolution in a long-lived FLOSS project , 2013, J. Softw. Evol. Process..

[84]  Siim Karus Automatic Means of Identifying Evolutionary Events in Software Development , 2013, 2013 IEEE International Conference on Software Maintenance.

[85]  Mark Harman,et al.  Using Genetic Improvement and Code Transplants to Specialise a C++ Program to a Problem Class , 2014, EuroGP.

[86]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[87]  Lilian Besson,et al.  CamDavidsonPilon/lifelines: v0.23.8 , 2020 .

[88]  Paul Heckel,et al.  A technique for isolating differences between files , 1978, CACM.

[89]  Chanchal Kumar Roy,et al.  LHDiff: Tracking Source Code Lines to Support Software Maintenance Activities , 2013, 2013 IEEE International Conference on Software Maintenance.

[90]  Laurie A. Williams,et al.  Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities , 2011, IEEE Transactions on Software Engineering.

[91]  Ioannis Stamelos,et al.  Survival analysis on the duration of open source projects , 2010, Inf. Softw. Technol..

[92]  Yuriy Brun,et al.  The plastic surgery hypothesis , 2014, SIGSOFT FSE.

[93]  Dag I. K. Sjøberg,et al.  Towards a framework for empirical assessment of changeability decay , 2000, J. Syst. Softw..

[94]  Giuseppe Scanniello Source code survival with the Kaplan Meier , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[95]  Martin P. Robillard,et al.  Representing concerns in source code , 2007, TSEM.

[96]  Collin McMillan,et al.  Automatically generating commit messages from diffs using neural machine translation , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[97]  Eirini Kalliamvakou,et al.  Mediterranean Conference on Information Systems ( MCIS ) 2009 Measuring Developer Contribution From Software Repository Data , 2017 .

[98]  Meir M. Lehman Programs, Cities, Students— Limits to Growth? , 1978 .

[99]  Xiaolong Zheng,et al.  Analyzing open-source software systems as complex networks , 2008 .