CloneCognition: machine learning based code clone validation tool

A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the original source codes. The generalization often results in returning code fragments that are only coincidentally similar and not considered clones by users, and hence requires manual validation of the reported possible clones by users which is often both time-consuming and challenging. In this paper, we propose a machine learning based tool 'CloneCognition' (Open Source Codes: https://github.com/pseudoPixels/CloneCognition ; Video Demonstration: https://www.youtube.com/watch?v=KYQjmdr8rsw) to automate the laborious manual validation process. The tool runs on top of any code clone detection tools to facilitate the clone validation process. The tool shows promising clone classification performance with an accuracy of up to 87.4%. The tool also exhibits significant improvement in the results when compared with state-of-the-art techniques for code clone validation.

[1]  Michael W. Godfrey,et al.  "Cloning Considered Harmful" Considered Harmful , 2006, 2006 13th Working Conference on Reverse Engineering.

[2]  Chanchal Kumar Roy,et al.  An Empirical Study of Function Clones in Open Source Software , 2008, 2008 15th Working Conference on Reverse Engineering.

[3]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[4]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[5]  Michael W. Godfrey,et al.  Aiding comprehension of cloning through categorization , 2004, Proceedings. 7th International Workshop on Principles of Software Evolution, 2004..

[6]  Barbara G. Ryder,et al.  CCLearner: A Deep Learning-Based Clone Detection Approach , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[7]  Chanchal Kumar Roy,et al.  [Research Paper] On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[8]  Elmar Jürgens,et al.  Do code clones matter? , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[9]  Jeffrey G. Gray,et al.  An information retrieval process to aid in the analysis of code clones , 2009, Empirical Software Engineering.

[10]  J. Howard Johnson,et al.  Visualizing textual redundancy in legacy source , 1994, CASCON.

[11]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[12]  Martin P. Robillard,et al.  Tracking Code Clones in Evolving Software , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[14]  Chanchal Kumar Roy,et al.  A Machine Learning Based Approach for Evaluating Clone Detection Tools for a Generalized and Accurate Precision , 2016, Int. J. Softw. Eng. Knowl. Eng..

[15]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[16]  Shinji Kusumoto,et al.  Classification model for code clones based on machine learning , 2015, Empirical Software Engineering.

[17]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[18]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[19]  Jeff Thomas Svajlenko,et al.  Large-Scale Clone Detection and Benchmarking , 2018 .

[20]  Jeffrey G. Gray,et al.  Phoenix-based clone detection using suffix trees , 2006, ACM-SE 44.