Functional code clone detection with syntax and semantics fusion learning

Clone detection of source code is among the most fundamental software engineering techniques. Despite intensive research in the past decade, existing techniques are still unsatisfactory in detecting "functional" code clones. In particular, existing techniques cannot efficiently extract syntax and semantics information from source code. In this paper, we propose a novel joint code representation that applies fusion embedding techniques to learn hidden syntactic and semantic features of source codes. Besides, we introduce a new granularity for functional code clone detection. Our approach regards the connected methods with caller-callee relationships as a functionality and the method without any caller-callee relationship with other methods represents a single functionality. Then we train a supervised deep learning model to detect functional code clones. We conduct evaluations on a large dataset of C++ programs and the experimental results show that fusion learning can significantly outperform the state-of-the-art techniques in detecting functional code clones.

[1]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[2]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[3]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[4]  Jian Pei,et al.  Asymmetric Transitivity Preserving Graph Embedding , 2016, KDD.

[5]  Michael A. Harrison,et al.  Accurate static estimators for program optimization , 1994, PLDI '94.

[6]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[7]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[8]  Rajiv Gupta,et al.  Code Compaction of Matching Single-Entry Multiple-Exit Regions , 2003, SAS.

[9]  Yang Liu,et al.  graph2vec: Learning Distributed Representations of Graphs , 2017, ArXiv.

[10]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[11]  Souhei Ito Semantical Equivalence of the Control Flow Graph and the Program Dependence Graph , 2018, ArXiv.

[12]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[13]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[14]  Chanchal Kumar Roy,et al.  NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[15]  X. Liu,et al.  Program Translation by Manipulating Abstract Syntax Trees , 1987, C++ Workshop.

[16]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[17]  Michel Dagenais,et al.  Extending software quality assessment techniques to Java systems , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[18]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Jugal K. Kalita,et al.  Semantic Clone Detection Using Machine Learning , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[20]  Xuelong Li,et al.  Joint Embedding Learning and Sparse Regression: A Framework for Unsupervised Feature Selection , 2014, IEEE Transactions on Cybernetics.

[21]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[22]  Miryung Kim,et al.  Does Automated Refactoring Obviate Systematic Editing? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[23]  Yuanyuan Zhou,et al.  CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code , 2004, OSDI.

[24]  Claire Le Goues,et al.  Automatically finding patches using genetic programming , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[25]  Shigeru Chiba,et al.  Cross-Language Clone Detection by Learning Over Abstract Syntax Trees , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[26]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[27]  Daniel Svozil,et al.  Introduction to multi-layer feed-forward neural networks , 1997 .

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[30]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[31]  Manziba Akanda Nishi,et al.  Scalable code clone detection and search based on adaptive prefix filtering , 2018, J. Syst. Softw..

[32]  Davood Mazinanian,et al.  Clone Refactoring with Lambda Expressions , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[33]  Maninder Singh,et al.  Software clone detection: A systematic review , 2013, Inf. Softw. Technol..

[34]  Giuliano Antoniol,et al.  Comparison and Evaluation of Clone Detection Tools , 2007, IEEE Transactions on Software Engineering.

[35]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[36]  Davood Mazinanian,et al.  Assessing the Refactorability of Software Clones , 2015, IEEE Transactions on Software Engineering.

[37]  R. Holmes,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[38]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[39]  Zhendong Su,et al.  Context-based detection of clone-related bugs , 2007, ESEC-FSE '07.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  A WagnerTim,et al.  Accurate static estimators for program optimization , 1994 .

[42]  Magdalena Balazinska,et al.  Advanced clone-analysis to support object-oriented system refactoring , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[43]  Wenwu Zhu,et al.  Structural Deep Network Embedding , 2016, KDD.