CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code

A code clone is a code portion in source files that is identical or similar to another. Since code clones are believed to reduce the maintainability of software, several code clone detection techniques and tools have been proposed. This paper proposes a new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison. For its implementation with several useful optimization techniques, we have developed a tool, named CCFinder (Code Clone Finder), which extracts code clones in C, C++, Java, COBOL and other source files. In addition, metrics for the code clones have been developed. In order to evaluate the usefulness of CCFinder and metrics, we conducted several case studies where we applied the new tool to the source code of JDK, FreeBSD, NetBSD, Linux, and many other systems. As a result, CCFinder has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems. In addition, we have compared the proposed technique with other clone detection techniques.

[1]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[2]  J. Howard Johnson,et al.  Identifying redundancy in source code using fingerprints , 1993, CASCON.

[3]  J. Howard Johnson,et al.  Substring matching for clone detection and change tracking , 1994, Proceedings 1994 International Conference on Software Maintenance.

[4]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[5]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[6]  D. Gusfield Algorithms on Strings, Trees, and Sequences: Introduction to Suffix Trees , 1997 .

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[11]  L. D. Moura,et al.  Clone detection using abstract syntax trees , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[12]  James M. Bieman,et al.  Using design abstractions to visualize, quantify, and restructure software , 1998, J. Syst. Softw..

[13]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[14]  Magdalena Balazinska,et al.  Partial redesign of Java software systems based on clone analysis , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[15]  Magdalena Balazinska,et al.  Measuring clone based reengineering opportunities , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[16]  Akito Monden,et al.  Software quality analysis by code clones in industrial legacy software , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[17]  Katsuro Inoue,et al.  The Detection of Fault-Prone Program Using a Neural Network , 2003 .

[18]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.