Evaluating clone detection tools with BigCloneBench

Many clone detection tools have been proposed in the literature. However, our knowledge of their performance in real software systems is limited, particularly their recall. In this paper, we use our big data clone benchmark, BigCloneBench, to evaluate the recall of ten clone detection tools. BigCloneBench is a collection of eight million validated clones within IJaDataset-2.0, a big data software repository containing 25,000 open-source Java systems. BigCloneBench contains both intra-project and inter-project clones of the four primary clone types. We use this benchmark to evaluate the recall of the tools per clone type and across the entire range of clone syntactical similarity. We evaluate the tools for both single-system and cross-project detection scenarios. Using multiple clone-matching metrics, we evaluate the quality of the tools' reporting of the benchmark clones with respect to refactoring and automatic clone analysis use-cases. We compare these real-world results against our Mutation and Injection Framework, a synthetic benchmark, to reveal deeper understanding of the tools. We found that the tools have strong recall for Type-1 and Type-2 clones, as well as Type-3 clones with high syntactical similarity. The tools have weaker detection of clones with lower syntactical similarity.

[1]  Mark Harman,et al.  Searching for better configurations: a rigorous approach to clone evaluation , 2013, ESEC/FSE 2013.

[2]  Warren Toomey,et al.  Ctcompare: Code clone detection using hashed token sequences , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[3]  Shinji Kusumoto,et al.  A dataset of clone references with gaps , 2014, MSR 2014.

[4]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[5]  Chanchal Kumar Roy,et al.  Towards a mutation-based automatic framework for evaluating code clone detection tools , 2008, C3S2E '08.

[6]  Chanchal Kumar Roy,et al.  Evaluating Modern Clone Detection Tools , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[7]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[8]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[9]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[10]  Ying Zou,et al.  Threshold-free code clone detection for a large-scale heterogeneous Java repository , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[11]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[12]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[13]  Brenda S. Baker,et al.  Finding Clones with Dup: Analysis of an Experiment , 2007, IEEE Transactions on Software Engineering.

[14]  Rainer Koschke,et al.  Empirical evaluation of clone detection using syntax suffix trees , 2008, Empirical Software Engineering.

[15]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[16]  Chanchal Kumar Roy,et al.  SimCad: An extensible and faster clone detection tool for large scale software systems , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[17]  Wei Le,et al.  A code clone oracle , 2014, MSR 2014.

[18]  Elmar Jürgens,et al.  CloneDetective - A workbench for clone detection research , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[19]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[20]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[21]  Chanchal Kumar Roy,et al.  A mutation analysis based benchmarking framework for clone detectors , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[22]  Chanchal Kumar Roy,et al.  An Empirical Study of Function Clones in Open Source Software , 2008, 2008 15th Working Conference on Reverse Engineering.

[23]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[24]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[25]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[26]  Elizabeth Burd,et al.  Evaluating clone detection tools for use during preventative maintenance , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.