Java Code Clone Detection by Exploiting Semantic and Syntax Information From Intermediate Code-Based Graph

Code clone detection plays a critical role in the field of software engineering. To achieve this goal, developers are required to have rich development experience for finding the “functional” clone code. However, this is unfriendly to novice developers. Although many approaches were proposed to automatically detect code clones, the results are not satisfactory. A major reason is that it is difficult to extract syntax and semantic information from the source code. To resolve this problem, in this article, we develop a novel graph representation approach based on intermediate code to detect the functional code clones. This graph representation is built based on intermediate code compiled from the source code. By using it, we can easily utilize graph embedding techniques to extract syntactic and semantic features from abstract syntax tree, control flow graph, and DFG generated from intermediate code. After that, we use the Softmax classifier to detect functional code clone pairs. We evaluate the performance of the proposed graph representation approach based on intermediate code for the code clone detection task on the BigCloneBench dataset. In order to improve performance, the embedded representation of intermediate code is initialized based on pretrained vectors learned from the collected LLVM IR dataset in advance. The experimental results show that our proposed intermediate code-based graph approach performs better than existing functional code clone detection approaches. Especially for the type-4 code clone detection, our approach outperforms the baseline approaches by an average of 33.49% in the term of F1 score.

[1]  Chanchal K. Roy,et al.  The Mutation and Injection Framework: Evaluating Clone Detection Tools with Mutation Analysis , 2021, IEEE Transactions on Software Engineering.

[2]  Martin Grohe,et al.  word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data , 2020, PODS.

[3]  Cheng Wang,et al.  GMAN: A Graph Multi-Attention Network for Traffic Prediction , 2019, AAAI.

[4]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[5]  Shigeru Chiba,et al.  Cross-Language Clone Detection by Learning Over Abstract Syntax Trees , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[6]  Ruizhi Gao,et al.  MSeer—An Advanced Technique for Locating Multiple Bugs in Parallel , 2019, IEEE Transactions on Software Engineering.

[7]  A. Sharada,et al.  Word Embeddings - Skip Gram Model , 2019, ICICCT 2019 – System Reliability, Quality Control, Safety, Maintenance and Management.

[8]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[9]  Zhang Tao,et al.  CPGVA: Code Property Graph based Vulnerability Analysis by Deep Learning , 2018, 2018 10th International Conference on Advanced Infocomm Technology (ICAIT).

[10]  Zijun Zhang,et al.  Improved Adam Optimizer for Deep Neural Networks , 2018, 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).

[11]  Yanzhao Wu,et al.  CCAligner: A Token Based Large-Gap Clone Detector , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[12]  Omer Levy,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[13]  Aaron C. Courville,et al.  Mutual Information Neural Estimation , 2018, ICML.

[14]  Bernt Schiele,et al.  Feature Generating Networks for Zero-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Zhendong Mao,et al.  Knowledge Graph Embedding: A Survey of Approaches and Applications , 2017, IEEE Transactions on Knowledge and Data Engineering.

[16]  Min Wang,et al.  CCSharp: An Efficient Three-Phase Code Clone Detector Using Modified PDGs , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[17]  Xingqun Qi,et al.  Comparison of Support Vector Machine and Softmax Classifiers in Computer Vision , 2017, 2017 Second International Conference on Mechanical, Control and Computer Engineering (ICMCCE).

[18]  Kevin Chen-Chuan Chang,et al.  A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications , 2017, IEEE Transactions on Knowledge and Data Engineering.

[19]  Shikha Jain,et al.  CPLAG: Efficient plagiarism detection using bitwise operations , 2017, 2017 Tenth International Conference on Contemporary Computing (IC3).

[20]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[21]  Palash Goyal,et al.  Graph Embedding Techniques, Applications, and Performance: A Survey , 2017, Knowl. Based Syst..

[22]  Paramvir Singh,et al.  Enhancing program dependency graph based clone detection using approximate subgraph matching , 2017, 2017 IEEE 11th International Workshop on Software Clones (IWSC).

[23]  Dong-Hong Ji,et al.  Learning Phrase Representations Based on Word and Character Embeddings , 2016, ICONIP.

[24]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[25]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[26]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[27]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[28]  Hayaru Shouno,et al.  Analysis of function of rectified linear unit used in deep learning , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[29]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[30]  Selvadurai Kanmani,et al.  Method-level code clone detection through LWH (Light Weight Hybrid) approach , 2014, Journal of Software Engineering Research and Development.

[31]  Isabelle Puaut,et al.  Traceability of Flow Information: Reconciling Compiler Optimizations and WCET Estimation , 2014, RTNS.

[32]  Kerstin Eder,et al.  Static analysis of energy consumption for LLVM IR programs , 2014, SCOPES.

[33]  Karthikeyan Sankaralingam,et al.  Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[34]  Michael D. Ernst,et al.  CBCD: Cloned buggy code detector , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[35]  Chanchal Kumar Roy,et al.  The NiCad Clone Detector , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[36]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[37]  Gilles Roussel,et al.  Syntax tree fingerprinting for source code similarity detection , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[38]  Yan Shi,et al.  Using an RBF Neural Network to Locate Program Bugs , 2008, 2008 19th International Symposium on Software Reliability Engineering (ISSRE).

[39]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[40]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[41]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[42]  Mohsen Jamali,et al.  Different Aspects of Social Network Analysis , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[43]  Erik J. Linstead,et al.  General Terms Languages , 2022 .

[44]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[45]  Mattia Monga,et al.  Detecting Self-mutating Malware Using Control-Flow Graph Matching , 2006, DIMVA.

[46]  Jeffrey S. Foster,et al.  Understanding source code evolution using abstract syntax tree matching , 2005, MSR.

[47]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[48]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[49]  C. Y. Peng,et al.  An Introduction to Logistic Regression Analysis and Reporting , 2002 .

[50]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[51]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[52]  L. D. Moura,et al.  Clone detection using abstract syntax trees , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[53]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[54]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[55]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[56]  John Cocke,et al.  A program data flow analysis procedure , 1976, CACM.

[57]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[58]  Gene H. Golub,et al.  Methods for modifying matrix factorizations , 1972, Milestones in Matrix Computation.

[59]  Chandan Kumar Chanda,et al.  State-of-Health Estimation and End of Life Prediction for the Lithium-Ion Battery by Correlatable Feature-based Machine Learning Approach , 2021, Int. J. Perform. Eng..

[60]  Wang Xue,et al.  Software Fault Detection for Sequencing Constraint Defects , 2020, Int. J. Perform. Eng..

[61]  Jugal K. Kalita,et al.  Expert Systems With Applications , 2022 .

[62]  Vidushi Sharma,et al.  Detection of File Level Clone for High Level Cloning , 2015 .

[63]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.