A general path-based representation for predicting program properties

Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming, and increase programmer productivity. A major challenge when learning from programs is how to represent programs in a way that facilitates effective learning. We present a general path-based representation for learning from programs. Our representation is purely syntactic and extracted automatically. The main idea is to represent a program using paths in its abstract syntax tree (AST). This allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens. We show that this representation is general and can: (i) cover different prediction tasks, (ii) drive different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages. We evaluate our approach on the tasks of predicting variable names, method names, and full types. We use our representation to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C#. Our evaluation shows that our approach obtains better results than task-specific handcrafted representations across different tasks and programming languages.

[1]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[2]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[3]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[4]  Ben Taskar,et al.  Graphical Models in a Nutshell , 2007 .

[5]  Daniel Jurafsky,et al.  Automatic Labeling of Semantic Roles , 2002, CL.

[6]  Sumit Gulwani,et al.  Program Synthesis Using Natural Language , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[7]  Omer Levy,et al.  A Simple Word Embedding Model for Lexical Substitution , 2015, VS@HLT-NAACL.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Einar W. Høst,et al.  Debugging Method Names , 2009, ECOOP.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Petar Tsankov,et al.  Statistical Deobfuscation of Android Applications , 2016, CCS.

[12]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[13]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[14]  Robert D. Macredie,et al.  The effects of comments and identifier names on program comprehensibility: an experimental investigation , 1996, J. Program. Lang..

[15]  Yijun Yu,et al.  Relating Identifier Naming Flaws and Code Quality: An Empirical Study , 2009, 2009 16th Working Conference on Reverse Engineering.

[16]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[17]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[18]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[19]  Meital Zilberstein,et al.  Leveraging a corpus of natural language descriptions for program similarity , 2016, Onward!.

[20]  Martin T. Vechev,et al.  Phrase-Based Statistical Translation of Programming Languages , 2014, Onward!.

[21]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[22]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[23]  D. Koller,et al.  2 Graphical Models in a Nutshell , 2008 .

[24]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[25]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[28]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[29]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[30]  Zhendong Su,et al.  On the naturalness of software , 2012, ICSE 2012.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[33]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[34]  Charles A. Sutton,et al.  Mining idioms from source code , 2014, SIGSOFT FSE.

[35]  Andrew Begel,et al.  Cognitive Perspectives on the Role of Naming in Computer Programs , 2006, PPIG.

[36]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[37]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[38]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[39]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[40]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[41]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[42]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[43]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..