CLCMiner: Detecting Cross-Language Clones without Intermediates

SUMMARY The proliferation of diverse kinds of programming lan- guages and platforms makes it a common need to have the same functionality implemented in di ff erent languages for di ff erent platforms, such as Java for Android applications and C# for Windows phone applications. Although versions of code written in di ff erent languages appear syntactically quite di ff erent from each other, they are intended to implement the same software and typically contain many code snippets that implement similar functionalities, which we call cross-language clones . When the version of code in one language evolves according to changing functionality require- ments and / or bug fixes, its cross-language clones may also need be changed to maintain consistent implementations for the same functionality. Thus, it is needed to have automated ways to locate and track cross-language clones within the evolving software. In the literature, approaches for de- tecting cross-language clones are only for languages that share a common intermediate language (such as the .NET language family) because they are built on techniques for detecting single-language clones. To extend the capability of cross-language clone detection to more diverse kinds of lan- guages, we propose a novel automated approach, CLCMiner , without the need of an intermediate language. It mines such clones from revision his- tories, based on our assumption that revisions to di ff erent versions of code implemented in di ff erent languages may naturally reflect how programmers change cross-language clones in practice, and that similarities among the revisions (referred to as clones in di ff s or di ff clones ) may indicate actual similar code. We have implemented a prototype and applied it to ten open source projects implementations in both Java and C#. The reported clones that occur in revision histories are of high precisions (89% on average) and recalls (95% on average). Compared with token-based code clone detec- tion tools that can treat code as plain texts, our tool can detect significantly more cross-language clones. All the evaluation results demonstrate the fea- sibility of revision-history based techniques for detecting cross-language clones without intermediates and point to promising future work.

[1]  Siau-Cheng Khoo,et al.  Predicting Consistent Clone Change , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[2]  Jianjun Zhao,et al.  Mining revision histories to detect cross-language clones without intermediates , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Katsuro Inoue,et al.  Towards Detection and Analysis of Interlanguage Clones for Multilingual Web Applications , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[4]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[5]  Shane McIntosh,et al.  Mining Co-change Information to Understand When Build Changes Are Necessary , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[6]  Andrea De Lucia,et al.  Labeling source code with information retrieval methods: an empirical study , 2013, Empirical Software Engineering.

[7]  David Lo,et al.  Understanding Widespread Changes: A Taxonomic Study , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[8]  Chanchal Kumar Roy,et al.  Detecting Clones Across Microsoft .NET Programming Languages , 2012, 2012 19th Working Conference on Reverse Engineering.

[9]  Katsuro Inoue,et al.  Extracting code clones for refactoring using combinations of clone metrics , 2011, IWSC '11.

[10]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[11]  Elmar Jürgens,et al.  CloneDetective - A workbench for clone detection research , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[12]  Daniel M. Germán,et al.  The promises and perils of mining git , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[13]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[14]  Nicholas A. Kraft,et al.  Cross-language Clone Detection , 2008, SEKE.

[15]  Jens Krinke,et al.  A Study of Consistent and Inconsistent Changes to Code Clones , 2007, 14th Working Conference on Reverse Engineering (WCRE 2007).

[16]  Stéphane Ducasse,et al.  Using concept analysis to detect co-change patterns , 2007, IWPSE '07.

[17]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[18]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[19]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[20]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[21]  Kenny Wong,et al.  Comprehension and Maintenance of Large-Scale Multi-Language Software Applications , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[22]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[23]  Andreas Zeller,et al.  Mining Version Histories to Guide Software Changes , 2004 .

[24]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[25]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[26]  Václav Rajlich,et al.  Removing clones from the code , 1999, J. Softw. Maintenance Res. Pract..

[27]  Zellig S. Harris,et al.  Distributional Structure , 1954 .