Studying software evolution using artefacts' shared information content

In order to study software evolution, it is necessary to measure artefacts representative of project releases. If we consider the process of software evolution to be copying with subsequent modification, then, by analogy, placing emphasis on what remains the same between releases will lead to focusing on similarity between artefacts. At the same time, software artefacts-stored digitally as binary strings-are all information. This paper introduces a new method for measuring software evolution in terms of artefacts' shared information content. A similarity value representing the quantity of information shared between artefact pairs is produced using a calculation based on Kolmogorov complexity. Similarity values for releases are then collated over the software's evolution to form a map quantifying change through lack of similarity. The method has general applicability: it can disregard otherwise salient software features such as programming paradigm, language or application domain because it considers software artefacts purely in terms of the mathematically justified concept of information content. Three open-source projects are analysed to show the method's utility. Preliminary experiments on udev and git verify the measurement of the projects' evolutions. An experiment on ArgoUML validates the measured evolution against experimental data from other studies.

[1]  David Clark,et al.  Quantitative Information Flow, Relations and Polymorphic Types , 2005, J. Log. Comput..

[2]  Sebastiaan Terwijn,et al.  Nonapproximability of the normalized information distance , 2009, J. Comput. Syst. Sci..

[3]  Chris F. Kemerer,et al.  A Metrics Suite for Object Oriented Design , 2015, IEEE Trans. Software Eng..

[4]  Nathan D. Price,et al.  Biological Information as Set-Based Complexity , 2010, IEEE Transactions on Information Theory.

[5]  N. Chapin,et al.  An entropy metric for software maintainability , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume II: Software Track.

[6]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[7]  Fernando Brito e Abreu,et al.  Evaluating the impact of object-oriented design on software quality , 1996, Proceedings of the 3rd International Software Metrics Symposium.

[8]  Alessandro Orso,et al.  A differencing algorithm for object-oriented programs , 2004 .

[9]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[10]  Zhen-ming Yuan,et al.  A Program Plagiarism Detection Model Based on Information Distance and Clustering , 2007 .

[11]  Martin P. Robillard,et al.  Tracking Code Clones in Evolving Software , 2007, 29th International Conference on Software Engineering (ICSE'07).

[12]  Richard H. Carver,et al.  An Evaluation of the MOOD Set of Object-Oriented Software Metrics , 1998, IEEE Trans. Software Eng..

[13]  Horst Zuse,et al.  Software complexity: Measures and methods , 1990 .

[14]  Ming Li,et al.  Information Distance and its Applications , 2006, Int. J. Found. Comput. Sci..

[15]  Jonathan I. Maletic,et al.  Journal of Software Maintenance and Evolution: Research and Practice Survey a Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution , 2022 .

[16]  Li Yujian,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  R. N. Chanon On a measure of program structure , 1974, Symposium on Programming.

[18]  Andrew Walenstein,et al.  06301 Summary -- Duplication, Redundancy, and Similarity in Software , 2006, Duplication, Redundancy, and Similarity in Software.

[19]  Ellis Horowitz,et al.  Software Cost Estimation with COCOMO II , 2000 .

[20]  Mark Harman,et al.  The Current State and Future of Search Based Software Engineering , 2007, Future of Software Engineering (FOSE '07).

[21]  Edward B. Allen,et al.  Measuring size, complexity, and coupling of hypergraph abstractions of software: An information-theory approach , 2007, Software Quality Journal.

[22]  Eleni Stroulia,et al.  UMLDiff: an algorithm for object-oriented design differencing , 2005, ASE.

[23]  J. Howard Johnson,et al.  Identifying redundancy in source code using fingerprints , 1993, CASCON.

[24]  Michele Lanza,et al.  Software bugs and evolution: a visual approach to uncover their relationship , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[25]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[26]  John C. Munson,et al.  An approach to the measurement of software evolution , 2005, J. Softw. Maintenance Res. Pract..

[27]  Tracy Hall,et al.  Measuring OO systems: a critical analysis of the MOOD metrics , 1999, Proceedings Technology of Object-Oriented Languages and Systems. TOOLS 29 (Cat. No.PR00275).

[28]  Tom Mens,et al.  Future trends in software evolution metrics , 2001, IWPSE '01.

[29]  Leo Hellerman,et al.  A Measure of Computational Work , 1972, IEEE Transactions on Computers.

[30]  Mary Shaw,et al.  Software Metrics: An Analysis and Evaluation , 1981 .

[31]  Serge Demeyer,et al.  Software Evolution , 2010 .

[32]  Alessandro Orso,et al.  JDiff: A differencing technique and tool for object-oriented programs , 2006, Automated Software Engineering.

[33]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[34]  John C. Munson,et al.  An approach to the measurement of software evolution: Research Articles , 2005 .

[35]  Manuel Cebrián,et al.  The Normalized Compression Distance Is Resistant to Noise , 2007, IEEE Transactions on Information Theory.

[36]  K. J. Ottenstein An algorithmic approach to the detection and prevention of plagiarism , 1976, SGCS.

[37]  Michel Wermelinger,et al.  Empirical Studies of Open Source Evolution , 2008, Software Evolution.

[38]  Gerardo Canfora,et al.  Ldiff: An enhanced line differencing tool , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[39]  Rainer Koschke,et al.  Locating Features in Source Code , 2003, IEEE Trans. Software Eng..

[40]  Lawrence H. Putnam,et al.  A General Empirical Solution to the Macro Software Sizing and Estimating Problem , 1978, IEEE Transactions on Software Engineering.

[41]  Ieee Standard,et al.  Adoption of ISO/IEC 15939:2007— Systems and Software Engineering— Measurement Process , 2009 .

[42]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[43]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[44]  Martin Hitz,et al.  Chidamber & Kemerer's Metrics Suite: a Measurement Theory Perspective , 1996 .

[45]  Samuel L. Grier,et al.  A tool that detects plagiarism in Pascal programs , 1981, SIGCSE '81.

[46]  Richard C. Holt,et al.  MoJo: a distance metric for software clusterings , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[47]  Kook-Yeol Yoo,et al.  A Fast Intra MB Mode Decision Method for the MPEG-2 to H.264 Transcoder , 2007 .

[48]  Mark Lorenz,et al.  Object-oriented software metrics - a practical guide , 1994 .

[49]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[50]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[51]  Chong Long,et al.  Multi-document Summarization by Information Distance , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[52]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[53]  Ann-Marie Lancaster,et al.  A plagiarism detection system , 1981, SIGCSE '81.

[54]  Stacy J. Prowell,et al.  Foundations of Sequence-Based Software Specification , 2003, IEEE Trans. Software Eng..

[55]  H. Kagdi,et al.  Expressiveness and effectiveness of program comprehension: Thoughts on future research directions , 2008, 2008 Frontiers of Software Maintenance.

[56]  Christoph Treude,et al.  Difference computation of large models , 2007, ESEC-FSE '07.

[57]  Hugo T. Jankowitz Detecting Plagiarism in Student Pascal Programs , 1988, Comput. J..

[58]  Raymond J. Rubey,et al.  Quantitative measurement of program quality , 1968, ACM '68.

[59]  Yuanfang Cai,et al.  Analyzing the Evolution of Large-Scale Software Systems Using Design Structure Matrices and Design Rule Theory: Two Exploratory Cases , 2008, Seventh Working IEEE/IFIP Conference on Software Architecture (WICSA 2008).

[60]  M. Douglas McIlroy Macro instruction extensions of compiler languages , 1960, CACM.

[61]  Dror G. Feitelson,et al.  The Linux kernel as a case study in software evolution , 2010, J. Syst. Softw..

[62]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[63]  Sallie M. Henry,et al.  Software Structure Metrics Based on Information Flow , 1981, IEEE Transactions on Software Engineering.

[64]  Kostadin Koroutchev,et al.  Detecting translations of the same text and data with common source , 2006 .

[65]  Letha H. Etzkorn,et al.  Semantic software metrics computed from natural language design specifications , 2008, IET Softw..

[66]  Hoan Anh Nguyen,et al.  Scalable and incremental clone detection for evolving software , 2009, 2009 IEEE International Conference on Software Maintenance.

[67]  Alfonso Ortega,et al.  Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor , 2005, Commun. Inf. Syst..

[68]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[69]  Dennis K. Peters,et al.  Software Documents: Comparison and Measurement , 2007, SEKE.

[70]  Mihai Datcu,et al.  A Similarity Measure Using Smallest Context-Free Grammars , 2010, 2010 Data Compression Conference.

[71]  Brenda S. Baker Parameterized diff , 1999, SODA '99.

[72]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[73]  Shari Lawrence Pfleeger,et al.  Preliminary Guidelines for Empirical Research in Software Engineering , 2002, IEEE Trans. Software Eng..

[74]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[75]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[76]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[77]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[78]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[79]  Victor R. Basili,et al.  An Empirical Study of a Syntactic Complexity Family , 1983, IEEE Transactions on Software Engineering.

[80]  Michael W. Godfrey,et al.  The past, present, and future of software evolution , 2008, 2008 Frontiers of Software Maintenance.

[81]  Neil A. Thacker,et al.  The Bhattacharyya metric as an absolute similarity measure for frequency coded data , 1998, Kybernetika.

[82]  Brenda S. Baker,et al.  Finding Clones with Dup: Analysis of an Experiment , 2007, IEEE Transactions on Software Engineering.

[83]  Norman E. Fenton,et al.  When a software measure is not a measure , 1992, Softw. Eng. J..

[84]  Tom Mens,et al.  Challenges in software evolution , 2005, Eighth International Workshop on Principles of Software Evolution (IWPSE'05).

[85]  Alain Abran,et al.  Metrology, measurement and metrics in software engineering , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[86]  Samantha Jenkins,et al.  Information theory-based software metrics and obfuscation , 2004, J. Syst. Softw..

[87]  Hal Berghel,et al.  Measurements of program similarity in identical task environments , 1984, SIGP.

[88]  Arie van Deursen,et al.  Mining Software Repositories to Study Co-Evolution of Production & Test Code , 2008, 2008 1st International Conference on Software Testing, Verification, and Validation.

[89]  Mansur H. Samadzadeh,et al.  Software reuse and information theory based metrics , 1991, [Proceedings] 1991 Symposium on Applied Computing.

[90]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[91]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[92]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[93]  Gerardo Canfora,et al.  Identifying Changed Source Code Lines from Version Repositories , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[94]  Rudi Lutz,et al.  Evolving good hierarchical decompositions of complex systems , 2001, J. Syst. Archit..

[95]  Richard F. Paige,et al.  Different models for model matching: An analysis of approaches to support model differencing , 2009, 2009 ICSE Workshop on Comparison and Versioning of Software Models.

[96]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[97]  Tracy Hall,et al.  A Critical Analysis of Current OO Design Metrics , 1999, Software Quality Journal.

[98]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[99]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[100]  Michael L. Cook,et al.  Software metrics , 1982, ACM SIGSOFT Softw. Eng. Notes.

[101]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[102]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[103]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[104]  L. L. CAMPBELL,et al.  Entropy as a measure , 1965, IEEE Trans. Inf. Theory.

[105]  U. Speidel A note on the estimation of string complexity for short strings , 2009, 2009 7th International Conference on Information, Communications and Signal Processing (ICICS).

[106]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[107]  Tom Arbuckle,et al.  Measure software - and its evolution - using information content , 2009, IWPSE-Evol '09.

[108]  M. Shepperd,et al.  A critique of cyclomatic complexity as a software metric , 1988, Softw. Eng. J..

[109]  Keith H. Bennett,et al.  Software maintenance and evolution: a roadmap , 2000, ICSE '00.

[110]  Stephen McCamant,et al.  Quantitative information flow as network flow capacity , 2008, PLDI '08.

[111]  Bernard Robinet Programming Symposium, Proceedings Colloque sur la Programmation , 1974 .

[112]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[113]  Periklis Andritsos,et al.  Information-theoretic software clustering , 2005, IEEE Transactions on Software Engineering.

[114]  Mike Bauer,et al.  Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative Research, November 5-7, 2001, Toronto, Ontario, Canada , 2001, CASCON.

[115]  William M. Evanco,et al.  Comments on "The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics" , 2003, IEEE Trans. Software Eng..

[116]  Stéphane Ducasse,et al.  Correlating features and code using a compact two-sided trace analysis approach , 2005, Ninth European Conference on Software Maintenance and Reengineering.

[117]  Tom Arbuckle,et al.  Visually Summarising Software Change , 2008, 2008 12th International Conference Information Visualisation.

[118]  Kenneth Ward Church,et al.  Dotplot : a program for exploring self-similarity in millions of lines of text and code , 1993 .

[119]  Khaled El Emam,et al.  The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics , 2001, IEEE Trans. Software Eng..

[120]  Susan Horwitz,et al.  Identifying the semantic and textual differences between two versions of a program , 1990, PLDI '90.

[121]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[122]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[123]  Hossein Saiedian,et al.  An architecture-centric software maintainability assessment using information theory , 2009 .

[124]  Herbert A. Simon,et al.  Aggregation of Variables in Dynamic Systems , 1961 .

[125]  Manuel Cebrián,et al.  Towards the Validation of Plagiarism Detection Tools by Means of Grammar Evolution , 2009, IEEE Transactions on Evolutionary Computation.

[126]  Taghi M. Khoshgoftaar,et al.  Applications of information theory to software engineering measurement , 1994, Software Quality Journal.

[127]  Avinash C. Kak,et al.  API-Based and Information-Theoretic Metrics for Measuring the Quality of Software Modularization , 2007 .

[128]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[129]  James Martin,et al.  Programming real-time computer systems , 1966 .

[130]  Neville Churcher,et al.  Comments on "A Metrics Suite for Object Oriented Design" , 1995, IEEE Trans. Software Eng..

[131]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[132]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[133]  C. Jones,et al.  Software metrics: good, bad and missing , 1994, Computer.

[134]  Christopher Alexander Notes on the Synthesis of Form , 1964 .

[135]  Darrel C. Ince,et al.  A critique of three metrics , 1994, J. Syst. Softw..

[136]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[137]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.