Move-optimized source code tree differencing

When it is necessary to express changes between two source code files as a list of edit actions (an edit script), modern tree differencing algorithms are superior to most text-based approaches because they take code movements into account and express source code changes more accurately. We present 5 general optimizations that can be added to state-of-the-art tree differencing algorithms to shorten the resulting edit scripts. Applied to Gumtree, RTED, JSync, and ChangeDistiller, they lead to shorter scripts for 1898% of the changes in the histories of 9 open-source software repositories. These optimizations also are parts of our novel Move-optimized Tree DIFFerencing algorithm (MTD-IFF) that has a higher accuracy in detecting moved code parts. MTDIFF (which is based on the ideas of ChangeDistiller) further shortens the edit script for another 20% of the changes in the repositories. MTDIFF and all the benchmarks are available under an open-source license.

[1]  Matias Martinez,et al.  Fine-grained and accurate source code differencing , 2014, ASE.

[2]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[3]  Marsha Chechik,et al.  Semantic Slicing of Software Version Histories (T) , 2015, ASE.

[4]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  Dana Shapira,et al.  Edit distance with move operations , 2002, J. Discrete Algorithms.

[6]  Miryung Kim,et al.  Does Automated Refactoring Obviate Systematic Editing? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[7]  Bin Ma,et al.  Computing similarity between RNA structures , 1999, Theor. Comput. Sci..

[8]  Michel de Rougemont,et al.  Correctors for XML Data , 2004, XSym.

[9]  Akira Mori,et al.  Diff/TS: A Tool for Fine-Grained Structural Change Analysis , 2008, 2008 15th Working Conference on Reverse Engineering.

[10]  Chanchal Kumar Roy,et al.  LHDiff: A Language-Independent Hybrid Approach for Tracking Source Code Lines , 2013, 2013 IEEE International Conference on Software Maintenance.

[11]  Erik D. Demaine,et al.  An optimal decomposition algorithm for tree edit distance , 2006, TALG.

[12]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[13]  Eleni Stroulia,et al.  UMLDiff: an algorithm for object-oriented design differencing , 2005, ASE.

[14]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[15]  Akira Mori,et al.  A comprehensive and scalable method for analyzing fine-grained source code change patterns , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[16]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[17]  Tatsuya Akutsu Tree Edit Distance Problems: Algorithms and Applications to Bioinformatics , 2010, IEICE Trans. Inf. Syst..

[18]  Alessandro Orso,et al.  A differencing algorithm for object-oriented programs , 2004, Proceedings. 19th International Conference on Automated Software Engineering, 2004..

[19]  Harald C. Gall,et al.  Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction , 2007, IEEE Transactions on Software Engineering.

[20]  Steven P. Reiss,et al.  Tracking source locations , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[21]  Erik D. Demaine,et al.  An O(n^3)-Time Algorithm for Tree Edit Distance , 2005, ArXiv.

[22]  Claudia Biermann,et al.  Mathematical Methods Of Statistics , 2016 .

[23]  Olga Baysal,et al.  diffX: an algorithm to detect changes in multi-version XML documents , 2005, CASCON.

[24]  K. Pearson On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[25]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[26]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[27]  J. Rubin,et al.  Semantic Slicing of Software Version Histories , 2018, IEEE Transactions on Software Engineering.

[28]  Karl Pearson F.R.S. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[29]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[30]  Miryung Kim,et al.  A program differencing algorithm for verilog HDL , 2010, ASE.

[31]  Miryung Kim,et al.  Lase: Locating and applying systematic edits by learning from examples , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[32]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.

[33]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[34]  Jing Li,et al.  The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies , 2010, 2010 Asia Pacific Software Engineering Conference.

[35]  Eugene W. Myers,et al.  A file comparison program , 1985, Softw. Pract. Exp..

[36]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[37]  Gerardo Canfora,et al.  Tracking Your Changes: A Language-Independent Approach , 2009, IEEE Software.

[38]  David Leon,et al.  Dex: a semantic-graph differencing tool for studying changes in large code bases , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[39]  Hoan Anh Nguyen,et al.  Clone Management for Evolving Software , 2012, IEEE Transactions on Software Engineering.

[40]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[41]  Uwe M. Borghoff,et al.  XCC: change control of XML documents , 2010, Computer Science - Research and Development.

[42]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..