CCSharp: An Efficient Three-Phase Code Clone Detector Using Modified PDGs

Detecting code clones in software systems is becoming more and more important with the blossom of open source projects. In spite of numerous active researches, there is still a lack of detecting clones especially high-level clones efficiently and accurately. In this paper, we present CCSharp, a three-phase PDG-based clone detector which can detect much more clones besides high-level ones in software systems. To solve the problem of PDG-based tool's high time cost, we adopt two strategies to decrease the overall computing quantity of our tool: PDG's structure modification and characteristic vector filtering. In PDG's structure modification, we propose a novel technique to merge procedure invocation nodes which can make clone detection get rid of influence of procedure's parameters and disguise as well as downscale PDG's structure. We proceed clone detection on both real-world and artificial codebase by CCSharp along with other three state-of-the-art tools. Experiment results show that CCSharp has both high recall and precision, and can detect much more unique clones compared with the other three tools.

[1]  Brenda S. Baker,et al.  Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance , 1997, SIAM J. Comput..

[2]  A. Mockus,et al.  Large-Scale Code Reuse in Open Source Software , 2007, First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07: ICSE Workshops 2007).

[3]  Giuliano Antoniol,et al.  Analyzing cloning evolution in the Linux kernel , 2002, Inf. Softw. Technol..

[4]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[5]  Michael D. Ernst,et al.  CBCD: Cloned buggy code detector , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[6]  Byung Ro Moon,et al.  Measuring Source Code Similarity by Finding Similar Subgraph with an Incremental Genetic Algorithm , 2016, GECCO.

[7]  Zhenchang Xing,et al.  Cloning practices: Why developers clone and what can be changed , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[8]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[9]  Lerina Aversano,et al.  An empirical study on the maintenance of source code clones , 2010, Empirical Software Engineering.

[10]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[11]  S Kanmani,et al.  Extracting the similarity in detected software clones using metrics , 2010, 2010 International Conference on Computer and Communication Technology (ICCCT).

[12]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Jun Sun,et al.  Clonepedia: Summarizing Code Clones by Common Syntactic Context for Software Maintenance , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[14]  Haoyu Wang,et al.  WuKong: a scalable and accurate two-phase approach to Android app clone detection , 2015, ISSTA.

[15]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[16]  Arutyun Avetisyan,et al.  Scalable and accurate detection of code clones , 2016, Programming and Computer Software.

[17]  Mario Vento,et al.  A Performance Comparison of Five Algorithms for Graph Isomorphism , 2001 .

[18]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[19]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[20]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[21]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[22]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Karl J. Ottenstein,et al.  The program dependence graph in a software development environment , 1984, SDE 1.

[24]  Jun Sun,et al.  Detecting differences across multiple instances of code clones , 2014, ICSE.

[25]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[26]  Heejung Kim,et al.  MeCC: memory comparison-based clone detector , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[27]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[28]  Manishankar Mondal,et al.  An Empirical Study of the Impacts of Clones in Software Maintenance , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[29]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[30]  Davood Mazinanian,et al.  Clone Refactoring with Lambda Expressions , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).