How are functionally similar code clones syntactically different? An empirical study and a benchmark

Background. Today, redundancy in source code, so-called ‘‘clones’’ caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, not caused by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactical differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in <16% of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.

[1]  Bernhard Schätz,et al.  Can clone detection support quality assessments of requirements specifications? , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[2]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[3]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[4]  Emad Shihab,et al.  CCCD: Concolic code clone detection , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[5]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[6]  R. Yin Case Study Research: Design and Methods , 1984 .

[7]  Stefan Wagner,et al.  Software Product Quality Control , 2013, Springer Berlin Heidelberg.

[8]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[9]  Stefan Wagner,et al.  Challenges of the Dynamic Detection of Functionally Similar Code Fragments , 2012, 2012 16th European Conference on Software Maintenance and Reengineering.

[10]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[11]  Stefan Wagner,et al.  Detection of Functionally Similar Code Clones: Data, Analysis Software, Benchmark , 2014 .

[12]  Rainer Koschke,et al.  Survey of Research on Software Clones , 2006, Duplication, Redundancy, and Similarity in Software.

[13]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[14]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[15]  Elmar Jürgens,et al.  Tool Support for Continuous Quality Control , 2008, IEEE Software.

[16]  Zhendong Su,et al.  Context-based detection of clone-related bugs , 2007, ESEC-FSE '07.

[17]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[18]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[19]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[20]  Heejung Kim,et al.  MeCC: memory comparison-based clone detector , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[21]  Christine Nadel,et al.  Case Study Research Design And Methods , 2016 .

[22]  Toshihiro Kamiya,et al.  Agec: An execution-semantic clone detection tool , 2013, 2013 21st International Conference on Program Comprehension (ICPC).

[23]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2012, Springer Berlin Heidelberg.

[24]  Ewan D. Tempero,et al.  Towards a curated collection of code clones , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[25]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[26]  Rainer Koschke,et al.  An extended assessment of type-3 clones as detected by state-of-the-art tools , 2011, Software Quality Journal.

[27]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[28]  Elmar Jürgens,et al.  Code Similarities Beyond Copy & Paste , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[29]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[30]  Dietmar Pfahl,et al.  Reporting guidelines for controlled experiments in software engineering , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[31]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[32]  Yun Yang,et al.  Towards a clone detection benchmark suite and results archive , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[33]  Shinji Kusumoto,et al.  Incremental Code Clone Detection: A PDG-based Approach , 2011, 2011 18th Working Conference on Reverse Engineering.

[34]  Chanchal Kumar Roy,et al.  Evaluating Modern Clone Detection Tools , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.