Deep learning code fragments for code clone detection

Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend on generic, handcrafted features to represent code fragments. We introduce learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository. Our code analysis supports a framework, which relies on deep learning, for automatically linking patterns mined at the lexical level with patterns mined at the syntactic level. We evaluated our novel learning-based approach for code clone detection with respect to feasibility from the point of view of software maintainers. We sampled and manually evaluated 398 file- and 480 method-level pairs across eight real-world Java systems; 93% of the file- and method-level samples were evaluated to be true positives. Among the true positives, we found pairs mapping to all four clone types. We compared our approach to a traditional structure-oriented technique and found that our learning-based approach detected clones that were either undetected or suboptimally reported by the prominent tool Deckard. Our results affirm that our learning-based approach is suitable for clone detection and a tenable technique for researchers.

[1]  Brenda S. Baker Parameterized Pattern Matching: Algorithms and Applications , 1996, J. Comput. Syst. Sci..

[2]  Jean-Luc Gauvain,et al.  Training Neural Network Language Models on Very Large Corpora , 2005, HLT.

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Susan Horwitz,et al.  Using Slicing to Identify Duplication in Source Code , 2001, SAS.

[5]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[6]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[7]  Derek C. Rose,et al.  Deep Machine Learning - A New Frontier in Artificial Intelligence Research [Research Frontier] , 2010, IEEE Computational Intelligence Magazine.

[8]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[9]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[10]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[11]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  R. Koschke,et al.  Frontiers of software clone management , 2008, 2008 Frontiers of Software Maintenance.

[13]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14]  Premkumar T. Devanbu,et al.  CACHECA: A Cache Language Model Based Code Suggestion Tool , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[15]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[16]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[17]  William W. Cohen,et al.  Natural Language Models for Predicting Programming Comments , 2013, ACL.

[18]  Premkumar T. Devanbu,et al.  Clones: what is that smell? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[19]  L. Sridevi,et al.  Clone Detection Using Abstract Syntax Trees , 2016 .

[20]  Chanchal Kumar Roy,et al.  Comparison and evaluation of code clone detection techniques and tools: A qualitative approach , 2009, Sci. Comput. Program..

[21]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[22]  Shinji Kusumoto,et al.  How Accurate Is Coarse-grained Clone Detection?: Comparision with Fine-grained Detectors , 2014, Electron. Commun. Eur. Assoc. Softw. Sci. Technol..

[23]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[24]  Premkumar T. Devanbu,et al.  On the "naturalness" of buggy code , 2015, ICSE.

[25]  Miryung Kim,et al.  An Empirical Study of Long-Lived Code Clones , 2011, FASE.

[26]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Yun Yang,et al.  Problems creating task-relevant clone detection reference data , 2003, 10th Working Conference on Reverse Engineering, 2003. WCRE 2003. Proceedings..

[29]  Per Runeson,et al.  Guidelines for conducting and reporting case study research in software engineering , 2009, Empirical Software Engineering.

[30]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[31]  Christoph Goller,et al.  Learning task-dependent distributed representations by backpropagation through structure , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[32]  Paolo Tonella,et al.  Interpolated n-grams for model based testing , 2014, ICSE.

[33]  Romain Robbes,et al.  Language-Independent Clone Detection Applied to Plagiarism Detection , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[34]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[35]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[36]  Zhendong Su,et al.  Automatic mining of functionally equivalent code fragments via random testing , 2009, ISSTA.

[37]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[38]  Cristina V. Lopes,et al.  File cloning in open source Java projects: The good, the bad, and the ugly , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[39]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[40]  Seung-won Hwang,et al.  Instant code clone search , 2010, FSE '10.

[41]  J. Howard Johnson,et al.  Visualizing textual redundancy in legacy source , 1994, CASCON.

[42]  Hoan Anh Nguyen,et al.  Accurate and Efficient Structural Characteristic Feature Extraction for Clone Detection , 2009, FASE.

[43]  Lu Zhang,et al.  Can I clone this piece of code here? , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[44]  Geoffrey E. Hinton,et al.  Using very deep autoencoders for content-based image retrieval , 2011, ESANN.

[45]  Chanchal Kumar Roy,et al.  Evaluating clone detection tools with BigCloneBench , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[46]  Miryung Kim,et al.  An empirical study of code clone genealogies , 2005, ESEC/FSE-13.

[47]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[48]  Siau-Cheng Khoo,et al.  Scalable detection of missed cross-function refactorings , 2014, ISSTA 2014.

[49]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[50]  Zhendong Su,et al.  Context-based detection of clone-related bugs , 2007, ESEC-FSE '07.

[51]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[52]  Dongmei Zhang,et al.  XIAO: tuning code clones at hands of engineers in practice , 2012, ACSAC '12.

[53]  Elizabeth Burd,et al.  Evaluating clone detection tools for use during preventative maintenance , 2002, Proceedings. Second IEEE International Workshop on Source Code Analysis and Manipulation.

[54]  Akito Monden,et al.  Software quality analysis by code clones in industrial legacy software , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[55]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[56]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[57]  David W. Binkley,et al.  What’s in a Name? A Study of Identifiers , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[58]  J. Howard Johnson,et al.  Identifying redundancy in source code using fingerprints , 1993, CASCON.

[59]  Philip S. Yu,et al.  GPLAG: detection of software plagiarism by program dependence graph analysis , 2006, KDD '06.

[60]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[61]  Zhendong Su,et al.  A study of the uniqueness of source code , 2010, FSE '10.

[62]  Michael W. Godfrey,et al.  “Cloning considered harmful” considered harmful: patterns of cloning in software , 2008, Empirical Software Engineering.

[63]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[64]  Matthias Rieger,et al.  Effective Clone Detection Without Language Barriers , 2005 .

[65]  J. Howard Johnson,et al.  Substring matching for clone detection and change tracking , 1994, Proceedings 1994 International Conference on Software Maintenance.

[66]  Miryung Kim,et al.  An ethnographic study of copy and paste programming practices in OOPL , 2004, Proceedings. 2004 International Symposium on Empirical Software Engineering, 2004. ISESE '04..

[67]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[68]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[69]  Premkumar T. Devanbu,et al.  Will They Like This? Evaluating Code Contributions with Language Models , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[70]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[71]  Rainer Koschke,et al.  Survey of Research on Software Clones , 2006, Duplication, Redundancy, and Similarity in Software.

[72]  Katsuro Inoue,et al.  Finding file clones in FreeBSD Ports Collection , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[73]  Jeffrey C. Carver,et al.  On the need for human-based empirical validation of techniques and tools for code clone analysis , 2011, IWSC '11.

[74]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[75]  Peng Liu,et al.  Achieving accuracy and scalability simultaneously in detecting application clones on Android markets , 2014, ICSE.

[76]  Siau-Cheng Khoo,et al.  Vector abstraction and concretization for scalable detection of refactorings , 2014, FSE 2014.

[77]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[78]  Neil Davey,et al.  The development of a software clone detector , 1995 .

[79]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[80]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[81]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[82]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[83]  Mark Harman,et al.  Searching for better configurations: a rigorous approach to clone evaluation , 2013, ESEC/FSE 2013.

[84]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[85]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[86]  Anh Tuan Nguyen,et al.  Lexical statistical machine translation for language migration , 2013, ESEC/FSE 2013.

[87]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[88]  Rainer Koschke,et al.  Clone Detection Using Abstract Syntax Suffix Trees , 2006, 2006 13th Working Conference on Reverse Engineering.

[89]  Arie van Deursen,et al.  Data clone detection and visualization in spreadsheets , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[90]  Heejung Kim,et al.  MeCC: memory comparison-based clone detector , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[91]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[92]  Hoan Anh Nguyen,et al.  Clone Management for Evolving Software , 2012, IEEE Transactions on Software Engineering.

[93]  Jeffrey C. Carver,et al.  Claims and beliefs about code clones: Do we agree as a community? A survey , 2012, 2012 6th International Workshop on Software Clones (IWSC).

[94]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[95]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[96]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[97]  References , 1971 .

[98]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[99]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[100]  José Nelson Amaral,et al.  Syntax errors just aren't natural: improving error reporting with language models , 2014, MSR 2014.

[101]  Hoan Anh Nguyen,et al.  Complete and accurate clone detection in graph-based models , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[102]  Anh Tuan Nguyen,et al.  Statistical learning approach for mining API usage mappings for code migration , 2014, ASE.

[103]  Andrian Marcus,et al.  Identification of high-level concept clones in source code , 2001, Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001).

[104]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[105]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[106]  Sheeva Afshan,et al.  Evolving Readable String Test Inputs Using a Natural Language Model to Reduce Human Oracle Cost , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[107]  Y. Anzai,et al.  Pattern Recognition & Machine Learning , 2016 .

[108]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[109]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[110]  Stéphane Ducasse,et al.  A language independent approach for detecting duplicated code , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[111]  David Lo,et al.  Active refinement of clone anomaly reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[112]  Anh Tuan Nguyen,et al.  Migrating code with statistical machine translation , 2014, ICSE Companion.

[113]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[114]  Weiqiang Zhang,et al.  RNN language model with word clustering and class-based output layer , 2013, EURASIP J. Audio Speech Music. Process..

[115]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[116]  Andrew Walenstein,et al.  The Software Similarity Problem in Malware Analysis , 2006, Duplication, Redundancy, and Similarity in Software.

[117]  Markus Pizka,et al.  Concise and Consistent Naming , 2005, IWPC.

[118]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.