A Novel Neural Source Code Representation Based on Abstract Syntax Tree

Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches.

[1]  Shane McIntosh,et al.  An Empirical Comparison of Model Validation Techniques for Defect Prediction Models , 2017, IEEE Transactions on Software Engineering.

[2]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[3]  Paulo Borba,et al.  Understanding semi-structured merge conflict characteristics in open-source Java projects , 2017, Empirical Software Engineering.

[4]  David Lo,et al.  An empirical assessment of Bellon's clone benchmark , 2015, EASE.

[5]  Katsuro Inoue,et al.  MUDABlue: An Automatic Categorization System for Open Source Repositories , 2004, APSEC.

[6]  Guy L. Steele,et al.  The Java Language Specification, Java SE 8 Edition , 2013 .

[7]  Chanchal K. Roy,et al.  A Survey on Software Clone Detection Research , 2007 .

[8]  Brad A. Myers,et al.  Studying the language and structure in non-programmers' solutions to programming problems , 2001, Int. J. Hum. Comput. Stud..

[9]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Andrian Marcus,et al.  Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[12]  Cristina V. Lopes,et al.  SourcererCC: Scaling Code Clone Detection to Big-Code , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[13]  Hongyu Guo,et al.  Long Short-Term Memory Over Recursive Structures , 2015, ICML.

[14]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[16]  Ah Chung Tsoi,et al.  An improved algorithm for learning long-term dependency problems in adaptive processing of data structures , 2003, IEEE Trans. Neural Networks.

[17]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[18]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[19]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[20]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[23]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[24]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[25]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[26]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[27]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[28]  Premkumar T. Devanbu,et al.  Delft University of Technology On the “ Naturalness ” of Buggy Code , 2017 .

[29]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[30]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[31]  Ming Li,et al.  Enhancing the Unified Features to Locate Buggy Files by Exploiting the Sequential Nature of Source Code , 2017, IJCAI.

[32]  Matias Martinez,et al.  Fine-grained and accurate source code differencing , 2014, ASE.

[33]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[34]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[35]  Jian Pei,et al.  Asymmetric Transitivity Preserving Graph Embedding , 2016, KDD.

[36]  Stefanos Gritzalis,et al.  Examining the significance of high-level programming features in source code author classification , 2008, J. Syst. Softw..

[37]  Atul Prakash,et al.  A Framework for Source Code Search Using Program Patterns , 1994, IEEE Trans. Software Eng..

[38]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[39]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[40]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[41]  Phong Le,et al.  Quantifying the Vanishing Gradient and Long Distance Dependency Problem in Recursive Neural Networks and Recursive LSTMs , 2016, Rep4NLP@ACL.

[42]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[43]  Zhenchang Xing,et al.  Predicting semantically linkable knowledge in developer online forums via convolutional neural network , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44]  Collin McMillan,et al.  Automatically generating commit messages from diffs using neural machine translation , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[45]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[46]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[47]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[48]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[49]  Jane Cleland-Huang,et al.  Semantically Enhanced Software Traceability Using Deep Learning Techniques , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[50]  Andrian Marcus,et al.  Supporting program comprehension with source code summarization , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[51]  Name M. Lastname Automatically Finding Patches Using Genetic Programming , 2013 .

[52]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[53]  Eugene W. Myers,et al.  A precise inter-procedural data flow algorithm , 1981, POPL '81.

[54]  Tibor Gyimóthy,et al.  Modeling class cohesion as mixtures of latent topics , 2009, 2009 IEEE International Conference on Software Maintenance.

[55]  Andrea De Lucia,et al.  Using IR methods for labeling source code artifacts: Is it worthwhile? , 2012, 2012 20th IEEE International Conference on Program Comprehension (ICPC).

[56]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[57]  Michele Lanza,et al.  Evaluating defect prediction approaches: a benchmark and an extensive comparison , 2011, Empirical Software Engineering.

[58]  Zhendong Su,et al.  DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones , 2007, 29th International Conference on Software Engineering (ICSE'07).

[59]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[60]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[61]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[62]  Collin McMillan,et al.  On using machine learning to automatically classify software applications into domain categories , 2014, Empirical Software Engineering.

[63]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[64]  Jeffrey G. Gray,et al.  An information retrieval process to aid in the analysis of code clones , 2009, Empirical Software Engineering.