ast2vec: Utilizing Recursive Neural Encodings of Python Programs

Educational datamining involves the application of datamining techniques to student activity. However, in the context of computer programming, many datamining techniques can not be applied because they expect vector-shaped input whereas computer programs have the form of syntax trees. In this paper, we present ast2vec, a neural network that maps Python syntax trees to vectors and back, thereby facilitating datamining on computer programs as well as the interpretation of datamining results. Ast2vec has been trained on almost half a million programs of novice programmers and is designed to be applied across learning tasks without re-training, meaning that users can apply it without any need for (additional) deep learning. We demonstrate the generality of ast2vec in three settings: First, we provide example analyses using ast2vec on a classroom-sized dataset, involving visualization, student motion analysis, clustering, and outlier detection, including two novel analyses, namely a progress-variance-projection and a dynamical systems analysis. Second, we consider the ability of ast2vec to recover the original syntax tree from its vector representation on the training data and two further large-scale programming datasets. Finally, we evaluate the predictive capability of a simple linear regression on top of ast2vec, obtaining similar results to techniques that work directly on syntax trees. We hope ast2vec can augment the educational datamining toolbelt by making analyses of computer programs easier, richer, and more efficient.

[1]  Barbara Hammer,et al.  Example-based feedback provision using structured solution spaces , 2014, Int. J. Learn. Technol..

[2]  Chandan Raj Rupakheti,et al.  An Automated Framework for Recommending Program Elements to Novices (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Mark Guzdial,et al.  A multi-national, multi-institutional study of assessment of programming skills of first-year CS students , 2001, ITiCSE-WGR '01.

[4]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[5]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[6]  Barbara Hammer,et al.  Time Series Prediction for Graphs in Kernel and Dissimilarity Spaces , 2017, Neural Processing Letters.

[7]  Armando Fox,et al.  AutoStyle: Toward Coding Style Feedback At Scale , 2016, CSCW Companion.

[8]  Paul Roe,et al.  Static Analysis of Students' Java Programs , 2004, ACE.

[9]  Robert P. W. Duin,et al.  The Dissimilarity Representation for Pattern Recognition - Foundations and Applications , 2005, Series in Machine Perception and Artificial Intelligence.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Barbara Hammer,et al.  The Continuous Hint Factory - Providing Hints in Vast and Sparsely Populated Edit Distance Spaces , 2017, ArXiv.

[12]  Niels Pinkwart,et al.  How Do Learners Behave in Help-Seeking When Given a Choice? , 2015, AIED.

[13]  Sumit Gulwani,et al.  Automated clustering and program repair for introductory programming assignments , 2016, PLDI.

[14]  Thorsten Joachims,et al.  Latent Skill Embedding for Personalized Lesson Sequence Recommendation , 2016, ArXiv.

[15]  Yizhou Qian,et al.  Students’ Misconceptions and Other Difficulties in Introductory Programming , 2017, ACM Trans. Comput. Educ..

[16]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Irena Koprinska,et al.  Progress Networks as a Tool for Analysing Student Programming Difficulties , 2021, ACE.

[19]  Frank-Michael Schleif,et al.  Metric and non-metric proximity transformations at linear costs , 2014, Neurocomputing.

[20]  Barbara Hammer,et al.  Execution Traces as a Powerful Data Representation for Intelligent Tutoring Systems for Programming , 2016, EDM.

[21]  Peter J. Denning,et al.  Remaining trouble spots with computational thinking , 2017, Commun. ACM.

[22]  Barbara Hammer,et al.  Topographic Mapping of Large Dissimilarity Data Sets , 2010, Neural Computation.

[23]  Leonidas J. Guibas,et al.  Autonomously Generating Hints by Inferring Problem Solving Policies , 2015, L@S.

[24]  Steven Skiena,et al.  Syntax-Directed Variational Autoencoder for Structured Data , 2018, ICLR.

[25]  Janet Rountree,et al.  Learning and Teaching Programming: A Review and Discussion , 2003, Comput. Sci. Educ..

[26]  Petri Ihantola,et al.  Review of recent systems for automatic assessment of programming assignments , 2010, Koli Calling.

[27]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[28]  Karan Singh,et al.  Learning Linear Dynamical Systems via Spectral Filtering , 2017, NIPS.

[29]  Nicholas Lytle,et al.  A Comparison of the Quality of Data-Driven Programming Hint Generation Algorithms , 2019, International Journal of Artificial Intelligence in Education.

[30]  Irena Koprinska,et al.  Recursive Tree Grammar Autoencoders , 2020, ArXiv.

[31]  Leonidas J. Guibas,et al.  Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[32]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[33]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[34]  Leonidas J. Guibas,et al.  Codewebs: scalable homework search for massive open online programming courses , 2014, WWW.

[35]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[36]  Kenneth R. Koedinger,et al.  Data-Driven Hint Generation in Vast Solution Spaces: a Self-Improving Python Programming Tutor , 2015, International Journal of Artificial Intelligence in Education.

[37]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[38]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[39]  Rui Zhi Reducing the State Space of Programming Problems through Data-Driven Feature Detection , 2018 .

[40]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[41]  James R. Curran,et al.  A Data-Driven Method for Helping Teachers Improve Feedback in Computer Programming Automated Tutors , 2018, AIED.

[42]  Rob J Hyndman,et al.  Another look at measures of forecast accuracy , 2006 .

[43]  Thomas W. Price,et al.  Evaluation of a Data-driven Feedback Algorithm for Open-ended Programming , 2017, EDM.

[44]  Irena Koprinska,et al.  A Survey of Automated Programming Hint Generation: The HINTS Framework , 2019, ACM Comput. Surv..

[45]  Marco C. Campi,et al.  Learning dynamical systems in a stationary environment , 1996, Proceedings of 35th IEEE Conference on Decision and Control.

[46]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[47]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[48]  Kenneth R. Koedinger,et al.  A Canonicalizing Model for Building Programming Tutors , 2012, ITS.

[49]  Kirsti Ala-Mutka,et al.  A study of the difficulties of novice programmers , 2005, ITiCSE '05.

[50]  Barbara Hammer,et al.  Domain-Independent Proximity Measures in Intelligent Tutoring Systems , 2013, EDM.

[51]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[52]  Richard G. Baraniuk,et al.  Time-varying learning and content analytics via sparse factor analysis , 2013, KDD.

[53]  Philip J. Guo,et al.  OverCode: visualizing variation in student solutions to programming problems at scale , 2014, ACM Trans. Comput. Hum. Interact..

[54]  Tiffany Barnes,et al.  Generating Hints for Programming Problems Using Intermediate Output , 2014, EDM.

[55]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[56]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[57]  Gabriele B. Durrant,et al.  Methodological approaches at PhD and skills sought for research posts in academia: a mismatch? , 2009 .

[58]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.