A Survey of Software Clone Detection Techniques

If two fragments of source code are identical or similar to each other, they are called code clones. Code clones introduce difficulties in software maintenance and cause bug propagation. Software clones occur due to several reasons such as code reuse by copying pre-existing fragments, coding style, and repeated computation using duplicated functions with slight changes in variables or data structures used. If a code fragment is edited, it will have to be checked against all related code clones to see if they need to be modified as well. Removal, avoidance or refactoring of cloned code are other important issues in software maintenance. However, several research studies have demonstrated that removal or refactoring of cloned code is sometimes harmful. In this study, code clones, common types of clones, phases of clone detection, the state-ofthe-art in code clone detection techniques and tools, and challenges faced by clone detection techniques are discussed.

[1]  Katsuro Inoue,et al.  How to extract differences from similar programs? A cohesion metric approach , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[2]  Kanika Raheja,et al.  An Emerging Approach towards Code Clone Detection: Metric Based Approach on Byte Code , 2013 .

[3]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[4]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[5]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[6]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[7]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[8]  Shinji Kusumoto,et al.  How Accurate Is Coarse-grained Clone Detection?: Comparision with Fine-grained Detectors , 2014, Electron. Commun. Eur. Assoc. Softw. Sci. Technol..

[9]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[10]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[11]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[12]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[13]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[14]  Jürgen Wolff von Gudenberg,et al.  Clone detection in source code by frequent itemset techniques , 2004, Source Code Analysis and Manipulation, Fourth IEEE International Workshop on.

[15]  Bernhard Schätz,et al.  Model clone detection in practice , 2010, IWSC '10.

[16]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[17]  B. Hirsbrunner,et al.  An Algorithm for Detecting and Removing Clones in Java Code , 2006 .

[18]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[19]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[20]  Arie van Deursen,et al.  On the use of clone detection for identifying crosscutting concern code , 2005, IEEE Transactions on Software Engineering.

[21]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[22]  Shinji Kusumoto,et al.  Incremental Code Clone Detection: A PDG-based Approach , 2011, 2011 18th Working Conference on Reverse Engineering.

[23]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[24]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[25]  James R. Cordy,et al.  The TXL source transformation language , 2006, Sci. Comput. Program..

[26]  Yang Yuan,et al.  Boreas: an accurate and scalable token-based approach to code clone detection , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[27]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[28]  Seunghak Lee,et al.  SDD: high performance code clone detection system for large scale source code , 2005, OOPSLA '05.

[29]  Shinji Kusumoto,et al.  Classification model for code clones based on machine learning , 2015, Empirical Software Engineering.

[30]  Michel Dagenais,et al.  Extending software quality assessment techniques to Java systems , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[31]  Carlo Ghezzi,et al.  A hybrid approach (syntactic and textual) to clone detection , 2010, IWSC '10.

[32]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[33]  Elizabeth Burd,et al.  Evaluating clone detection tools for use during preventative maintenance , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.

[34]  G. Balabaskaran Method Level Detection and Removal of Code Clones in C and Java Programs using Refactoring , 2010 .

[35]  Serge Demeyer,et al.  Evaluating clone detection techniques from a refactoring perspective , 2004, Proceedings. 19th International Conference on Automated Software Engineering, 2004..

[36]  Shinji Kusumoto,et al.  Folding Repeated Instructions for Improving Token-Based Code Clone Detection , 2012, 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation.

[37]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[38]  Kevin A. Schneider,et al.  Agile Parsing in TXL , 2004, Automated Software Engineering.

[39]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[40]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[41]  S. A. Wani,et al.  Performance Evaluation of Clone Detection Tools , 2015 .

[42]  Rajiv Gupta,et al.  Code Compaction of Matching Single-Entry Multiple-Exit Regions , 2003, SAS.

[43]  L. Sridevi,et al.  Clone Detection Using Abstract Syntax Trees , 2016 .

[44]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[45]  Shinji Kusumoto,et al.  Problematic Code Clones Identification Using Multiple Detection Results , 2009, 2009 16th Asia-Pacific Software Engineering Conference.

[46]  Shinji Kusumoto,et al.  On Software Maintenance Process Improvement Based on Code Clone Analysis , 2002, PROFES.

[47]  Jean-Daniel Boissonnat,et al.  Proceedings of the twentieth annual symposium on Computational geometry , 2004, SoCG 2004.

[48]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[49]  Francesca Arcelli Fontana,et al.  Software Clone Detection and Refactoring , 2013 .

[50]  Bernhard Schätz,et al.  Clone detection in automotive model-based development , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[51]  R. Radhika,et al.  Detection of Type-1 and Type-2 Code Clones Using Textual Analysis and Metrics , 2010, 2010 International Conference on Recent Trends in Information, Telecommunication and Computing.

[52]  Salwa K. Abd-El-Hafiz,et al.  A Metrics-Based Data Mining Approach for Software Clone Detection , 2012, 2012 IEEE 36th Annual Computer Software and Applications Conference.

[53]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[54]  Kostas Kontogiannis,et al.  Detecting Code Similarity Using Patterns , 1995 .

[55]  Oscar Nierstrasz,et al.  On the effectiveness of clone detection by string matching , 2006, J. Softw. Maintenance Res. Pract..

[56]  Miryung Kim,et al.  Does Automated Refactoring Obviate Systematic Editing? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[57]  Elmar Jürgens,et al.  Index-based code clone detection: incremental, distributed, scalable , 2010, 2010 IEEE International Conference on Software Maintenance.

[58]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[59]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[60]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[61]  Shinji Kusumoto,et al.  Gapped code clone detection with lightweight source code analysis , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[62]  Rainer Koschke,et al.  Survey of Research on Software Clones , 2006, Duplication, Redundancy, and Similarity in Software.

[63]  Ettore Merlo,et al.  Detection of Plagiarism in University Projects Using Metrics-based Spectral Similarity , 2006, Duplication, Redundancy, and Similarity in Software.

[64]  Brenda S. Baker Parameterized Pattern Matching: Algorithms and Applications , 1996, J. Comput. Syst. Sci..

[65]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[66]  Katsuro Inoue,et al.  Extracting code clones for refactoring using combinations of clone metrics , 2011, IWSC '11.

[67]  Yogita Sharma,et al.  HYBRID TECHNIQUE FOR OBJECT ORIENTED SOFTWARE CLONE DETECTION , 2011 .

[68]  Sumit Kumar Yadav,et al.  A hybrid-token and textual based approach to find similar code segments , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[69]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[70]  Yang Yuan,et al.  CMCD: Count Matrix Based Code Clone Detection , 2011, 2011 18th Asia-Pacific Software Engineering Conference.

[71]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[72]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.