A Literature Review of Clone Detection Analysis

syntax tree clone analysis Abstract syntax tree clone analysis (Baxter et al, 1998) attempts to be more accurate than a line or programming language token based approach by building th e abstract syntax tree. Before taking advantage of its AST representatio , the tool first expands macros to ensure that all of the information will be in the AST. After building the AST, a hash of each of the AST subtrees is performed. This removes ide ntifiers. Comments and white space have already been removed by building the AST node tokens. Matching relies on hashes for each of the AST subtrees in the AST. The step first places all subsequences of the same length in similar buckets b ased on the similarity of the hash. All of the subsequences in this bucket are then compared aga inst the similarity threshold. The subsequences that pass are then passed on to a general ization process that visits the parents of the clone AST nodes until a set of parents is found that is not a code clone. Thus, the algorithm requires that clones match by exceedi ng the similarity threshold at each particular AST node in the hierarchy. The algorithm’s reliance on building all of the AST subt rees and doing relatively pricey AST operations makes it significantly slower than most other tools. The algorithm itself is O(|Subtrees of AST|). It runs in 120 minutes for 100 KLOC. The more powerful AST manipulations and AST approach allows the tool to correctly detect statement reorders and statement insertions. Slice based clone analysis Even more powerful than merely an AST approach, a tool (Komondoor and Horwitz, 2001) has also used every node in a program dependence graph to f orm slices to compare. A program dependence graph adds edges between state men s whenever a data value depends on another statement for its value. Thes e edges are either control or data dependencies. The token creation step consists of simp ly creating the program dependence graph from the source. Next, all of the program dependence graph nodes are partitio ned nto equivalence classes based on syntactic similarity of the statements. Di fferences in identifier or literal values are ignored. For each initial pair of program dependence no d s in the equivalence class, generalization proceeds to find the largest isomorphic subgr aphs of the program dependence graph including the two initial nodes. Backward and f orward slices are added to increase the size of the isomorphic subgraph until it is not possible to add any more slices. Unfortunately, the use of slicing makes the algorithm very s low. The tool takes 13 minutes to run on just 3419 LOC! The use of slicing does make i t th most accurate algorithm, however. Unlike the AST approach, similar clo nes do not have to include all children of some AST node parent. The approach can now de tect fully entangled clones where the actual cloned lines of code are spread far apart nd only linked by program dependence graph dependencies. It could thus be used to handle older or much more heavily modified clones where the original copy and past e code has been spread apart by the insertion and refactoring of functionality. It al so easily handles statement reorders. Call graph node origin analysis An approach very similar to clone detection analysis has been used to track the movement of code over time in a process called origin analysis (Godfry and Zou, 2005). Origin analysis attempts to ascertain, for every func tio , the function that it came from in the previous version. This is interesting when function s are split, merged, and renamed, as a simple name search through the previous version wil l ot establish a linkage with the current version. Origin analysis operates on the call gr ph, where every call graph node is a token. It attempts to match a function in one vers ion with the most similar function in a previous version. Rather than having a single transformation for each toke n, origin analysis provides a variety of “matchers” that both transform the call g raph node and rate similarity between clone candidates. Overall similarity can be any comb ination of the individual matchers. The name matcher finds the longest common substring of two unctions. The metrics matcher calculates the weighted sum of LOC, fan in/o ut, # variables, and cyclomatic complexity. The declaration matcher finds the longest co mm n substring of the lexically sorted parameter identifiers. Finally the call relati on matcher finds the size of the intersection between candidates’ caller and callee fun ctions. It is most useful for looking at splitting, merging, and renaming. Rather than run all of the matchers over all of the code in batch mode, the origin analysis tool provides an incremental, as needed analysis. The user m t select a set of candidate functions in each version, and the tool will then att empt to identify the origins of particular functions. This analysis is then “almost in antaneous” rather than taking a long time. Moreover, the user may wish to switch matc hers, matcher parameters, or matcher weights. Being interactive allows the develope r to quickly try all of these options without having to wait for a long batch computation o complete. Comparison of Algorithms The algorithms all strike compromises between providing mor e accurate results and running in a usable amount of time. To get better results, t here are two main approaches – use a more powerful analysis (AST or slicing) or build h euristics that filter specific cases of unwanted clones (CCFinder, Dup). Towards building better heuristics, CCFinder’s creators h ave spent time evaluating transformation rules on real systems including looking a t JDK, Linux, NetBSD, and FreeBSD. This allows them to evaluate their transfor mation rules and understand what rules are necessary to work well on real systems. Th e improvements in their accuracy and performance will likely come through this empirical s tudy of exactly how clones change rather than solving hard algorithms problem. In contrast, AST and Slicing approaches are much more algo rithm centric in relying on better program analysis to do their work for them. For t hem, the challenges that lie ahead will be coming up with faster algorithms. Research Directions Despite a number of tools over the course of a decade an d several recent empirical studies of code clones, there is still no solid definitio of what constitutes a clone. This is because clones inherently carry with them some values a bout engineering tradeoffs – are these duplicated pieces of code potentially worthwhile to factor out and remove. A better definition would allow the creation of benchmarks with w ich to make more meaningful comparisons of an approach’s ability to find not just class es of clones but individual clones. Yet today, even people can’t decide on what does or does not constitute a clone (Walenstein et al, 2003). A definition of code clones will probably need to entail some description and understanding of when code clones are signif icant enough to potentially be worth refactoring. This definition needs to be suff iciently formal that the clone detection tools can use it to filter clone candidates. While related, there is also a need for a better underst anding of what developers are likely to change right after performing a copy and paste. Attempt s to build heuristics will need to consider not just what is easy to detect but what type s of changes developers actually make. This type of information could be relatively easily gathered by simply logging everything a developer is doing and examining the data around the time of copies and pastes. Having this data would lead to better benchmarks of w hat needs to be prioritized. One area that none of the tools other than perhaps slic ing has gotten close to is being able to detect reimplementation clones. Syntactic or likel y even AST based techniques rely on information that is probably too low level to catch clo nes that have as a common source only being copied from some abstract algorithm in a textb ook. On the other hand, because the tools being used to count duplication do not detect th is, it is not even clear that this is a significant problem worth solving. Some emp irical study trying to examine what types of reimplementation clones exist in a syst em, how frequent they are, and how they could be best detected would probably be the most usef ul way to proceed. Clones also have the potential for a tie in with the movement towards making recipes and protocols more explicit in the design. A code clone coul d be considered as a type of recipe for solving some problem. Attempts to document comm n recipes for solving important tasks might be extended to encompass any task t h t it repetitive enough to require a developer to use copy and paste to implement it. Linking together the clones to the recipe would also remove much of their harm to the c hangeability of code. Existing tools seem to presume a reengineering or a perfecti ve maintenance scenario in generally being batch oriented. Developers are forced to w ade through a separate matrix of potential clones and have to launch and wait for the tool just to see any results. Seeing clone links on the left eclipse editor bar would probably be a much nicer interaction in allowing developers to be reminded of clones when they are wo king with the code in question. This allows them to immediately take the clon e information into account when beginning to consider any change to one or the possibilit y of building a new abstraction. Finally, one simple way of much more reliably detecting copy and paste code clones would simply be to log copy and paste. Each clone could rece ive some XML comment or annotation that contains a unique identifier for the clone class or a listing of links to other instances of the clone. Such a solution would not b e as valuable for a really messy legacy system but would help contain the problem of clone s. Conclusions Code clones are