SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch Mode and during Software Development

Given the availability of large source-code repositories, there has been a large number of applications for large-scale clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that scale to big software systems or large repositories, specifically for detecting near-miss (Type 3) clones where significant editing activities may take place in the cloned code. This paper demonstrates: (i) SourcererCC, a token-based clone detector that targets the first three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. It uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect clones, as well as the number of required token-comparisons needed to judge a potential clone; and (ii) SourcererCC-I, an Eclipse plug-in, that uses SourcererCC's core engine to identify and navigate clones (both inter and intra project) in real-time during software development. In our experiments, comparing SourcererCC with the state-of-the-art tools we found that it is the only clone detection tool to successfully scale to 250 MLOC on a standard workstation with 12 GB RAM and efficiently detect the first three types of clones (precision 86% and recall 86-100%). Link to the demo: \url{https://youtu.be/l7F_9Qp-ks4}

[1]  Chanchal Kumar Roy,et al.  Scaling classical clone detection tools for ultra-large datasets: An exploratory study , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[2]  Hajimu Iida,et al.  SHINOBI: A Tool for Automatic Code Clone Detection in the IDE , 2009, 2009 16th Working Conference on Reverse Engineering.

[3]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[4]  Iman Keivanloo,et al.  Internet-scale Real-time Code Clone Search Via Multi-level Indexing , 2011, 2011 18th Working Conference on Reverse Engineering.

[5]  Ferosh Jacob,et al.  CnP: Towards an environment for the proactive management of copy-and-paste programming , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[6]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[7]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[8]  Rainer Koschke,et al.  Reverse Engineering Variability in Source Code Using Clone Detection: A Case Study for Linux Variants of Consumer Electronic Devices , 2012, 2012 19th Working Conference on Reverse Engineering.

[9]  Chanchal Kumar Roy,et al.  Near-miss function clones in open source software : an empirical study , 2009 .

[10]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[11]  Peng Liu,et al.  Achieving accuracy and scalability simultaneously in detecting application clones on Android markets , 2014, ICSE.

[12]  Shinji Kusumoto,et al.  Inter-Project Functional Clone Detection Toward Building Libraries - An Empirical Study on 13,000 Projects , 2012, 2012 19th Working Conference on Reverse Engineering.

[13]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[14]  Katsuro Inoue,et al.  Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder , 2007, 29th International Conference on Software Engineering (ICSE'07).

[15]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[16]  Elmar Jürgens,et al.  Index-based code clone detection: incremental, distributed, scalable , 2010, 2010 IEEE International Conference on Software Maintenance.

[17]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[18]  Hitesh Sajnani,et al.  A parallel and efficient approach to large scale clone detection , 2013, IWSC 2013.

[19]  Martin P. Robillard,et al.  Clonetracker: tool support for code clone management , 2008, ICSE '08.

[20]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[21]  Rainer Koschke Large-Scale Inter-System Clone Detection Using Suffix Trees , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[22]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[23]  Michael W. Godfrey,et al.  Software bertillonage: finding the provenance of an entity , 2011, MSR '11.