Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance

As an aid in software maintenance, it would be useful to be able to track down duplication in large software systems efficiently. Duplication in code is often in the form of sections of code that are the same except for a systematic change of parameters such as identifiers and constants. To model such parameterized duplication in code, this paper introduces the notions of parameterized strings and parameterized matches of parameterized strings. A data structure called a parameterized suffix tree is defined to aid in searching for parameterized matches. For fixed alphabets, algorithms are given to construct a parameterized suffix tree in linear time and to find all maximal parameterized matches over a threshold length in a parameterized p-string in time linear in the size of the input plus the number of matches reported. The algorithms have been implemented, and experimental results show that they perform well on C code.

[1]  Kenneth Ward Church,et al.  Dotplot : a program for exploring self-similarity in millions of lines of text and code , 1993 .

[2]  Alfred V. Aho,et al.  Algorithms for Finding Patterns in Strings , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[3]  Hugo T. Jankowitz Detecting Plagiarism in Student Pascal Programs , 1988, Comput. J..

[4]  Edward R. Tufte,et al.  The Visual Display of Quantitative Information , 1986 .

[5]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[6]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[7]  Brenda S. Baker Parameterized Pattern Matching: Algorithms and Applications , 1996, J. Comput. Syst. Sci..

[8]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[9]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[10]  Raffaele Giancarlo,et al.  The Suffix of a square matrix, with applications , 1993, SODA '93.

[11]  Susan Horwitz,et al.  Identifying the semantic and textual differences between two versions of a program , 1990, PLDI '90.

[12]  Michael R. Genesereth,et al.  Logical foundations of artificial intelligence , 1987 .

[13]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[14]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[15]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[16]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[17]  B. Baker On Finding Duplication in Strings and Software , 1993 .

[18]  Raffaele Giancarlo,et al.  Data structures and algorithms for approximate string matching , 1988, J. Complex..

[19]  J. Meigs,et al.  WHO Technical Report , 1954, The Yale Journal of Biology and Medicine.