Open-source tools and benchmarks for code-clone detection: past, present, and future trends

A fragment of source code that is identical or similar to another is a code-clone. Code-clones make it difficult to maintain applications as they create multiple points within the code that bugs must be fixed, new rules enforced, or design decisions imposed. As applications grow larger and larger, the pervasiveness of code-clones likewise grows. To face the code-clone related issues, many tools and algorithms have been proposed to find and document code-clones within an application. In this paper, we present the historical trends in code-clone detection tools to show how we arrived at the current implementations. We then present our results from a systematic mapping study on current (2009-2019) code-clone detection tools with regards to technique, open-source nature, and language coverage. Lastly, we propose future directions for code-clone detection tools. This paper provides the essentials to understanding the code-clone detection process and the current state-of-art solutions.

[1]  Keun Ho Ryu,et al.  One pass preprocessing for token-based source code clone detection , 2014, 2014 IEEE 6th International Conference on Awareness Science and Technology (iCAST).

[2]  Chanchal Kumar Roy,et al.  A Mutation/Injection-Based Automatic Framework for Evaluating Code Clone Detection Tools , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[3]  Boris Lesner,et al.  A novel framework to detect source code plagiarism: now, students have to work for real! , 2010, SAC '10.

[4]  Sumit Kumar Yadav,et al.  A hybrid-token and textual based approach to find similar code segments , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[5]  Yang Yuan,et al.  Boreas: an accurate and scalable token-based approach to code clone detection , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[6]  Nisha Gupta,et al.  Detection of Code Clones , 2018, 2018 International Conference on Smart City and Emerging Technology (ICSCET).

[7]  Fan Jun Meng,et al.  A novel detection approach for statement clones , 2013, 2013 IEEE 4th International Conference on Software Engineering and Service Science.

[8]  Ginika Mahajan,et al.  Implementing a 3-way approach of clone detection and removal using PC Detector tool , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[9]  Chanchal Kumar Roy,et al.  SimCad: An extensible and faster clone detection tool for large scale software systems , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[10]  Serge Demeyer,et al.  Evaluating clone detection techniques from a refactoring perspective , 2004, Proceedings. 19th International Conference on Automated Software Engineering, 2004..

[11]  Ping Luo,et al.  BCFinder: A Lightweight and Platform-Independent Tool to Find Third-Party Components in Binaries , 2018, 2018 25th Asia-Pacific Software Engineering Conference (APSEC).

[12]  Zhenchang Xing,et al.  CloneDifferentiator: Analyzing clones by differentiation , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[13]  Shinji Kusumoto,et al.  Incremental Code Clone Detection: A PDG-based Approach , 2011, 2011 18th Working Conference on Reverse Engineering.

[14]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[15]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[16]  Suleman Shahid,et al.  Codeease: harnessing method clone structures for reuse , 2017, 2017 IEEE 11th International Workshop on Software Clones (IWSC).

[17]  Elizabeth Burd,et al.  Evaluating clone detection tools for use during preventative maintenance , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.

[18]  Krishna Narasimhan,et al.  Clone Merge -- An Eclipse Plugin to Abstract Near-Clone C++ Methods , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Thierry Lavoie,et al.  Computing Structural Types of Clone Syntactic Blocks , 2009, 2009 16th Working Conference on Reverse Engineering.

[20]  A New Method of Software Clone Detection Based on Binary Instruction Structure Analysis , 2012, 2012 8th International Conference on Wireless Communications, Networking and Mobile Computing.

[21]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[22]  Thierry Lavoie,et al.  An accurate estimation of the Levenshtein distance using metric trees and Manhattan distance , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[23]  Kazuaki Maeda,et al.  An Extended Line-Based Approach to Detect Code Clones Using Syntactic and Lexical Information , 2010, 2010 Seventh International Conference on Information Technology: New Generations.

[24]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[25]  Chanchal Kumar Roy,et al.  SeByte: A semantic clone detection tool for intermediate languages , 2012, 2012 20th IEEE International Conference on Program Comprehension (ICPC).

[26]  Min-Yen Kan,et al.  Instructor-centric source code plagiarism detection and plagiarism corpus , 2012, ITiCSE '12.

[27]  Gary T. Leavens,et al.  Semantic clone detection using method IOE-behavior , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[28]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[29]  Yan Cao,et al.  VFDETECT: A vulnerable code clone detection system based on vulnerability fingerprint , 2017, 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC).

[30]  Jan Kollar,et al.  Haskell clone detection using pattern comparing algorithm , 2015, 2015 13th International Conference on Engineering of Modern Electric Systems (EMES).

[31]  S. Sarala,et al.  Unifying clone analysis and refactoring activity advancement towards C# applications , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[32]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[33]  Stan Jarzabek,et al.  A Data Mining Approach for Detecting Higher-Level Clones in Software , 2009, IEEE Transactions on Software Engineering.

[34]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[35]  Jongmoo Choi,et al.  Detecting source code similarity using code abstraction , 2013, ICUIMC '13.

[36]  Angelos Stavrou,et al.  Resilient and Scalable Cloned App Detection Using Forced Execution and Compression Trees , 2018, 2018 IEEE Conference on Dependable and Secure Computing (DSC).

[37]  Chanchal Kumar Roy,et al.  CloneCognition: machine learning based code clone validation tool , 2019, ESEC/SIGSOFT FSE.

[38]  Zoran Budimac,et al.  LICCA: A tool for cross-language clone detection , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[39]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[40]  Arutyun Avetisyan,et al.  LLVM-based code clone detection framework , 2015, 2015 Computer Science and Information Technologies (CSIT).

[41]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[42]  Paramvir Singh,et al.  Enhancing program dependency graph based clone detection using approximate subgraph matching , 2017, 2017 IEEE 11th International Workshop on Software Clones (IWSC).

[43]  Iman Keivanloo,et al.  Semantic-Enabled Clone Detection , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[44]  Katsuro Inoue,et al.  Web-service for finding cloned files using b-bit minwise hashing , 2017, 2017 IEEE 11th International Workshop on Software Clones (IWSC).

[45]  Katsuro Inoue,et al.  CCFinderSW: Clone Detection Tool with Flexible Multilingual Tokenization , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[46]  Brenda S. Baker,et al.  A theory of parameterized pattern matching: algorithms and applications , 1993, STOC.

[47]  Byung Ro Moon,et al.  Measuring Source Code Similarity by Finding Similar Subgraph with an Incremental Genetic Algorithm , 2016, GECCO.

[48]  Takuo Nakashima,et al.  A Token-based Illicit Copy Detection Method Using Complexity for a Program Exercise , 2013, 2013 Eighth International Conference on Broadband and Wireless Computing, Communication and Applications.

[49]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[50]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[51]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[52]  Brenda S. Baker Parameterized Pattern Matching: Algorithms and Applications , 1996, J. Comput. Syst. Sci..

[53]  Huiqing Li,et al.  Incremental Clone Detection and Elimination for Erlang Programs , 2011, FASE.

[54]  Toshihiro Kamiya Conte∗t clones or re-thinking clone on a call graph , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[55]  Harpreet Kaur,et al.  Identification of Recurring Patterns of Code to Detect Structural Clones , 2016, 2016 IEEE 6th International Conference on Advanced Computing (IACC).

[56]  Warren Toomey,et al.  Ctcompare: Code clone detection using hashed token sequences , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[57]  Hajimu Iida,et al.  Code Clone Graph Metrics for Detecting Diffused Code Clones , 2009, 2009 16th Asia-Pacific Software Engineering Conference.

[58]  Todor Cholakov,et al.  Duplicate code detection algorithm , 2015, CompSysTech '15.

[59]  Antonella Santone,et al.  A novel approach based on formal methods for clone detection , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[60]  Basavaraju Muddu,et al.  CPDP: A robust technique for plagiarism detection in source code , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[61]  Gurpreet Singh,et al.  To enhance the code clone detection algorithm by using hybrid approach for detection of code clones , 2017, 2017 International Conference on Intelligent Computing and Control Systems (ICICCS).

[62]  Xiaowei Li,et al.  Fast Code Clone Detection Based on Weighted Recursive Autoencoders , 2019, IEEE Access.

[63]  Min Wang,et al.  CCSharp: An Efficient Three-Phase Code Clone Detector Using Modified PDGs , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[64]  Hajimu Iida,et al.  SHINOBI: A Tool for Automatic Code Clone Detection in the IDE , 2009, 2009 16th Working Conference on Reverse Engineering.

[65]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[66]  Xin Chen,et al.  Structural Function Based Code Clone Detection Using a New Hybrid Technique , 2018, 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC).

[67]  Shinji Kusumoto,et al.  Code Clone Detection on Specialized PDGs with Heuristics , 2011, 2011 15th European Conference on Software Maintenance and Reengineering.

[68]  Xin Chen,et al.  Detecting Java Code Clones Based on Bytecode Sequence Alignment , 2019, IEEE Access.

[69]  Lawton Nichols,et al.  Structural and Nominal Cross-Language Clone Detection , 2019, FASE.

[70]  Emad Shihab,et al.  CCCD: Concolic code clone detection , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[71]  Chanchal Kumar Roy,et al.  CloneWorks: A Fast and Flexible Large-Scale Near-Miss Clone Detection Tool , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[72]  R. Radhika,et al.  Detection of Type-1 and Type-2 Code Clones Using Textual Analysis and Metrics , 2010, 2010 International Conference on Recent Trends in Information, Telecommunication and Computing.

[73]  Takuo Nakashima,et al.  Token-based Code Clone Detection Technique in a Student's Programming Exercise , 2012, 2012 Seventh International Conference on Broadband, Wireless Computing, Communication and Applications.

[74]  Hwan-Gue Cho,et al.  Plagiarism detection among source codes using adaptive local alignment of keywords , 2011, ICUIMC.

[75]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[76]  Katsuro Inoue,et al.  Multilingual Detection of Code Clones Using ANTLR Grammar Definitions , 2018, 2018 25th Asia-Pacific Software Engineering Conference (APSEC).

[77]  Cristina V. Lopes,et al.  SourcererCC and SourcererCC-I: Tools to Detect Clones in Batch Mode and during Software Development , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[78]  Elmar Jürgens,et al.  Index-based code clone detection: incremental, distributed, scalable , 2010, 2010 IEEE International Conference on Software Maintenance.

[79]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[80]  Maninder Singh,et al.  Semantic Code Clone Detection Using Parse Trees and Grammar Recovery , 2013 .

[81]  Shruti Jadon,et al.  Code clones detection using machine learning technique: Support vector machine , 2016, 2016 International Conference on Computing, Communication and Automation (ICCCA).

[82]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[83]  Chanchal Kumar Roy,et al.  [Research Paper] On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[84]  Jugal Kalita,et al.  A Survey of Software Clone Detection Techniques , 2016 .

[85]  Toshihiro Kamiya,et al.  Agec: An execution-semantic clone detection tool , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[86]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).