CLCDSA: Cross Language Code Clone Detection using Syntactical Features and API Documentation

Software clones are detrimental to software maintenance and evolution and as a result many clone detectors have been proposed. These tools target clone detection in software applications written in a single programming language. However, a software application may be written in different languages for different platforms to improve the application's platform compatibility and adoption by users of different platforms. Cross language clones (CLCs) introduce additional challenges when maintaining multi-platform applications and would likely go undetected using existing tools. In this paper, we propose CLCDSA, a cross language clone detector which can detect CLCs without extensive processing of the source code and without the need to generate an intermediate representation. The proposed CLCDSA model analyzes different syntactic features of source code across different programming languages to detect CLCs. To support large scale clone detection, the CLCDSA model uses an action filter based on cross language API call similarity to discard non-potential clones. The design methodology of CLCDSA is two-fold: (a) it detects CLCs on the fly by comparing the similarity of features, and (b) it uses a deep neural network based feature vector learning model to learn the features and detect CLCs. Early evaluation of the model observed an average precision, recall and F-measure score of 0.55, 0.86, and 0.64 respectively for the first phase and 0.61, 0.93, and 0.71 respectively for the second phase which indicates that CLCDSA outperforms all available models in detecting cross language clones.

[1]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[2]  Pierre Baldi,et al.  Neural Networks for Fingerprint Recognition , 1993, Neural Computation.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Michel Dagenais,et al.  Extending software quality assessment techniques to Java systems , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[5]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[6]  Nicholas A. Kraft,et al.  Cross-language Clone Detection , 2008, SEKE.

[7]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[8]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[9]  Massimiliano Di Penta,et al.  An approach to identify duplicated web pages , 2002, Proceedings 26th Annual International Computer Software and Applications.

[10]  Jianjun Zhao,et al.  Mining revision histories to detect cross-language clones without intermediates , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Daniel J. Quinlan,et al.  Detecting code clones in binary executables , 2009, ISSTA.

[12]  Jugal Kalita,et al.  A Survey of Software Clone Detection Techniques , 2016 .

[13]  Václav Rajlich,et al.  Removing clones from the code , 1999, J. Softw. Maintenance Res. Pract..

[14]  Michael W. Godfrey,et al.  Supporting the analysis of clones in software systems , 2006, J. Softw. Maintenance Res. Pract..

[15]  Zhenchang Xing,et al.  Mining Analogical Libraries in Q&A Discussions -- Incorporating Relational and Categorical Knowledge into Word Embedding , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[16]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[17]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[18]  Klaus-Robert Müller,et al.  Better Representations: Invariant, Disentangled and Reusable , 2012, Neural Networks: Tricks of the Trade.

[19]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[20]  Ioannis Stamelos,et al.  The SQO-OSS Quality Model: Measurement Based Open Source Software Evaluation , 2008, OSS.

[21]  Chanchal Kumar Roy,et al.  An Empirical Study of Function Clones in Open Source Software , 2008, 2008 15th Working Conference on Reverse Engineering.

[22]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[23]  Daniel Perez A study of machine learning approaches to cross-language code clone detection , 2018 .

[24]  Shigeru Chiba,et al.  Cross-Language Clone Detection by Learning Over Abstract Syntax Trees , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[25]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[26]  Zoran Budimac,et al.  LICCA: A tool for cross-language clone detection , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[27]  Jianjun Zhao,et al.  CLCMiner: Detecting Cross-Language Clones without Intermediates , 2017, IEICE Trans. Inf. Syst..

[28]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[30]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[31]  Michael W. Godfrey,et al.  "Cloning Considered Harmful" Considered Harmful , 2006, 2006 13th Working Conference on Reverse Engineering.

[32]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[33]  Kostas Kontogiannis,et al.  Evaluation experiments on the detection of programming patterns using software metrics , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[34]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[35]  Rainer Koschke,et al.  Incremental Clone Detection , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Chanchal Kumar Roy,et al.  [Research Paper] CroLSim: Cross Language Software Similarity Detector Using API Documentation , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[38]  Fabrizio Montesi,et al.  Microservices: Yesterday, Today, and Tomorrow , 2017, Present and Ulterior Software Engineering.

[39]  Akito Monden,et al.  Software quality analysis by code clones in industrial legacy software , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[40]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[41]  Cristina V. Lopes,et al.  Oreo: detection of clones in the twilight zone , 2018, ESEC/SIGSOFT FSE.

[42]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[43]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[44]  Peter E. Bulychev,et al.  Duplicate code detection using anti-unification , 2008 .

[45]  Collin McMillan,et al.  Recommending source code for use in rapid software prototypes , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[46]  H. E. Chandler,et al.  Technical writer's handbook , 1982, IEEE Transactions on Professional Communication.

[47]  Chanchal Kumar Roy,et al.  Detecting Clones Across Microsoft .NET Programming Languages , 2012, 2012 19th Working Conference on Reverse Engineering.

[48]  Yoshua Bengio,et al.  Deep Learning for NLP (without Magic) , 2012, ACL.

[49]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[50]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[51]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[52]  Keith R. Matthews,et al.  Elementary Linear Algebra , 1998 .