Using compilation/decompilation to enhance clone detection

We study effects of compilation and decompilation to code clone detection in Java. Compilation/decompilation canonicalise syntactic changes made to source code and can be used as source code normalisation. We used NiCad to detect clones before and after decompilation in three open source software systems, JUnit, JFreeChart, and Tomcat. We filtered and compared the clones in the original and decompiled clone set and found that 1,201 clone pairs (78.7%) are common between the two sets while 326 pairs (21.3%) are only in one of the sets. A manual investigation identified 325 out of the 326 pairs as true clones. The 252 original-only clone pairs contain a single false positive while the 74 decompiled-only clone pairs are all true positives. Many clones in the original source code that are detected only after decompilation are type-3 clones that are dicult to detect due to added or deleted statements, keywords, package names; flipped if-else statements; or changed loops. We suggest to use decompilation as normalisation to compliment clone detection. By combining clones found before and after decompilation, one can achieve higher recall without losing precision.

[1]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[2]  Mark Harman,et al.  Searching for better configurations: a rigorous approach to clone evaluation , 2013, ESEC/FSE 2013.

[3]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[4]  Alan Mycroft,et al.  Type-Based Decompilation (or Program Reconstruction via Type Reconstruction) , 1999, ESOP.

[5]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[6]  Michael W. Godfrey,et al.  Compiling Clones: What Happens? , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[7]  Peng Liu,et al.  Achieving accuracy and scalability simultaneously in detecting application clones on Android markets , 2014, ICSE.

[8]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[9]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[10]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[11]  M. Godfrey,et al.  Bertillonage Determining the provenance of software development artifacts , 2011 .

[12]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[13]  Michael W. Godfrey,et al.  From Whence It Came: Detecting Source Code Clones by Analyzing Assembler , 2010, 2010 17th Working Conference on Reverse Engineering.

[14]  Michael W. Godfrey,et al.  Software Bertillonage , 2012, Empirical Software Engineering.

[15]  Cristina Cifuentes,et al.  Decompilation of binary programs , 1995, Softw. Pract. Exp..

[16]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[17]  David Clark,et al.  Similarity of Source Code in the Presence of Pervasive Modifications , 2016, 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[18]  Daniel M. Germán,et al.  Code siblings: Technical and legal implications of copying code between applications , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[19]  Todd A. Proebsting,et al.  Krakatoa: Decompilation in Java (Does Bytecode Reveal Source?) , 1997, COOTS.

[20]  Anthony Desnos Android: From Reversing to Decompilation , 2011 .

[21]  Peter T. Breuer,et al.  Decompilation: the enumeration of types and grammars , 1994, TOPL.

[22]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[23]  Chanchal Kumar Roy,et al.  Big data clone detection using classical detectors: an exploratory study , 2015, J. Softw. Evol. Process..

[24]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[25]  Ying Zou,et al.  Enhancing Source-Based Clone Detection Using Intermediate Representation , 2010, 2010 17th Working Conference on Reverse Engineering.

[26]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.