Convolutional Neural Networks over Tree Structures for Programming Language Processing

Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the artificial intelligence community. However, different from a natural language sentence, a program contains rich, explicit, and complicated structural information. Hence, traditional NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in which a convolution kernel is designed over programs' abstract syntax trees to capture structural information. TBCNN is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

[1]  Kai Fan,et al.  High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models , 2015, AAAI.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[4]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[5]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[6]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[7]  Andrew Begel,et al.  Deciphering the story of software development through frequent pattern mining , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[8]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[9]  Zhi Jin,et al.  Building Program Vector Representations for Deep Learning , 2014, KSEM.

[10]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Brad A. Myers,et al.  Studying the language and structure in non-programmers' solutions to programming problems , 2001, Int. J. Hum. Comput. Stud..

[12]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[13]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[15]  Sam Malek,et al.  Mining the execution history of a software system to infer the best time for its adaptation , 2012, SIGSOFT FSE.

[16]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[17]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[18]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[19]  Adrian Pilkington,et al.  The Language Instinct: The New Science of Language and Mind , 1996 .

[20]  Lu Zhang,et al.  Is This a Bug or an Obsolete Test? , 2013, ECOOP.

[21]  Alexander Egyed,et al.  Code patterns for automatically validating requirements-to-code traces , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[22]  Gilles Roussel,et al.  Syntax tree fingerprinting for source code similarity detection , 2009, 2009 IEEE 17th International Conference on Program Comprehension.

[23]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[24]  Bojan Cukic,et al.  Software defect prediction using semi-supervised learning with dimension reduction , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[25]  Daniela Steidl,et al.  Feature-based detection of bugs in clones , 2013, 2013 7th International Workshop on Software Clones (IWSC).

[26]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[27]  Andreas Zeller,et al.  Localizing Bugs in Program Executions with Graphical Models , 2009, NIPS.

[28]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Leonidas J. Guibas,et al.  Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[31]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[32]  Santosh Pande,et al.  Detecting memory leaks through introspective dynamic behavior modelling using machine learning , 2014, ICSE.

[33]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[34]  Zhi Jin,et al.  Discriminative Neural Sentence Modeling by Tree-Based Convolution , 2015, EMNLP.

[35]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[36]  S. Pinker The language instinct : the new science of language and mind , 1994 .

[37]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..