Inferring Repository File Structure Modifications Using Nearest-Neighbor Clone Detection

During the re-engineering of legacy software systems, a good knowledge of the history of past modifications on the system is important to recover the design of the system and transfer its functionalities. In the absence of a reliable revision history, development teams often rely on system experts to identify hidden history and recover software design. In this paper, we propose a new technique to infer the history of repository file modifications of a software system using only past released versions of the system. The proposed technique relies on nearest-neighbor clone detection using the Manhattan distance. We performed an empirical evaluation of the technique using Tomcat, JHotDraw and Adempiere SVN information as our oracle of file operations, and obtained an average precision of 97% and an average recall of 98%. Our evaluation also highlighted the phenomena of implicit Moves, which are, Moves between a system's versions, that are not recorded in the SVN repository. In the absence of revision history and software experts, development teams can make use of the proposed technique during the re-engineering of their legacy systems.

[1]  Oscar Nierstrasz,et al.  On the effectiveness of clone detection by string matching , 2006, J. Softw. Maintenance Res. Pract..

[2]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[3]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[4]  David Lorge Parnas,et al.  Software aging , 1994, Proceedings of 16th International Conference on Software Engineering.

[5]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[6]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[7]  Rainer Koschke,et al.  Software Clone Management Towards Industrial Application (Dagstuhl Seminar 12071) , 2012, Dagstuhl Reports.

[8]  Nicholas A. Kraft,et al.  Clone evolution: a systematic review , 2011, J. Softw. Evol. Process..

[9]  R. Yin Case Study Research: Design and Methods , 1984 .

[10]  Giuliano Antoniol,et al.  Linear complexity object-oriented similarity for clone detection and software evolution analyses , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[11]  Margo I. Seltzer,et al.  Provenance: a future history , 2009, OOPSLA Companion.

[12]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[13]  Yann-Gaël Guéhéneuc,et al.  Code Siblings: Phenotype Evolution , 2009 .

[14]  William F. Smyth,et al.  Efficient token based clone detection with flexible tokenization , 2007, ESEC-FSE companion '07.

[15]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[16]  W. Marsden I and J , 2012 .

[17]  Michael W. Godfrey,et al.  Determining the provenance of software artifacts , 2011, IWSC '11.

[18]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[19]  Thierry Lavoie,et al.  An accurate estimation of the Levenshtein distance using metric trees and Manhattan distance , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[20]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[21]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[22]  William Et.Al Hines,et al.  Probability and Statistics in Engineering , 2003 .

[23]  David L. Atkins Version Sensitive Editing: Change History as a Programming Tool , 1998, SCM.

[24]  Thierry Lavoie,et al.  Automated type-3 clone oracle using Levenshtein metric , 2011, IWSC '11.