Learning to Synthesize Programs as Interpretable and Generalizable Policies

Recently, deep reinforcement learning (DRL) methods have achieved impressive performance on tasks in a variety of domains. However, neural network policies produced with DRL methods are not human-interpretable and often have difficulty generalizing to novel scenarios. To address these issues, prior works explore learning programmatic policies that are more interpretable and structured for generalization. Yet, these works either employ limited policy representations (e.g. decision trees, state machines, or predefined program templates) or require stronger supervision (e.g. input/output state pairs or expert demonstrations). We present a framework that instead learns to synthesize a program, which details the procedure to solve a task in a flexible and expressive manner, solely from reward signals. To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to first learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task. Experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms DRL and program synthesis baselines while producing interpretable and more generalizable policies. We also justify the necessity of the proposed two-stage learning scheme as well as analyze various methods for learning the program embedding.

[1]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[2]  Dawn Xiaodong Song,et al.  Improving Neural Program Synthesis with Inferred Execution Traces , 2018, NeurIPS.

[3]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[4]  Rishabh Singh,et al.  SpreadsheetCoder: Formula Prediction from Semi-structured Context , 2021, ICML.

[5]  Nando de Freitas,et al.  Neural Programmer-Interpreters , 2015, ICLR.

[6]  Claire Le Goues,et al.  Automated program repair , 2019, Commun. ACM.

[7]  S. Levine,et al.  Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks , 2019, IEEE Robotics and Automation Letters.

[8]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[9]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[10]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Hongyu Zhang,et al.  Shaping program repair space with existing patches and similar code , 2018, ISSTA.

[12]  Dragica Radosav,et al.  Deep Learning and Medical Diagnosis: A Review of Literature , 2018, Multimodal Technol. Interact..

[13]  Joseph J. Lim,et al.  Composing Complex Skills by Learning Transition Policies , 2018, ICLR.

[14]  S. Shankar Sastry,et al.  Provably safe and robust learning-based model predictive control , 2011, Autom..

[15]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[16]  Sanja Fidler,et al.  Synthesizing Environment-Aware Activities via Activity Sketches , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Silvio Savarese,et al.  Neural Task Programming: Learning to Generalize Across Hierarchical Tasks , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Westley Weimer,et al.  Automated program repair through the evolution of assembly code , 2010, ASE.

[19]  Alexandre Campeau-Lecours,et al.  Kinova Modular Robot Arms for Service Robotics Applications , 2017, Int. J. Robotics Appl. Technol..

[20]  Martin White,et al.  Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities , 2017, 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[21]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[22]  Qi Xin,et al.  Leveraging syntax-related code for automated program repair , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Joseph J. Lim,et al.  Program Guided Agent , 2020, ICLR.

[24]  Jacques Klein,et al.  FixMiner: Mining relevant fix patterns for automated program repair , 2018, Empirical Software Engineering.

[25]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[26]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[29]  Allan Tucker,et al.  Estimating Uncertainty and Interpretability in Deep Learning for Coronavirus (COVID-19) Detection , 2020, ArXiv.

[30]  Insoon Yang,et al.  Risk-Aware Motion Planning and Control Using CVaR-Constrained Optimization , 2019, IEEE Robotics and Automation Letters.

[31]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[32]  N. E. Toklu,et al.  Program synthesis as latent continuous optimization: evolutionary search in neural embeddings , 2020, GECCO.

[33]  Rishabh Singh,et al.  Latent Programmer: Discrete Latent Codes for Program Synthesis , 2020, ICML.

[34]  Dawn Xiaodong Song,et al.  Making Neural Programming Architectures Generalize via Recursion , 2017, ICLR.

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[37]  Ruben Glatt,et al.  Discovering symbolic policies with deep reinforcement learning , 2021, ICML.

[38]  Quoc V. Le,et al.  Neural Program Synthesis with Priority Queue Training , 2018, ArXiv.

[39]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[40]  Juraj Kabzan,et al.  Cautious Model Predictive Control Using Gaussian Process Regression , 2017, IEEE Transactions on Control Systems Technology.

[41]  Ke Wang,et al.  Dynamic Neural Program Embedding for Program Repair , 2017, ICLR.

[42]  Frederick Liu,et al.  Estimating Training Data Influence by Tracking Gradient Descent , 2020, NeurIPS.

[43]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[44]  Armando Solar-Lezama,et al.  Write, Execute, Assess: Program Synthesis with a REPL , 2019, NeurIPS.

[45]  Amitojdeep Singh,et al.  Explainable Deep Learning Models in Medical Image Analysis , 2020, J. Imaging.

[46]  Sebastian Nowozin,et al.  DeepCoder: Learning to Write Programs , 2016, ICLR.

[47]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[48]  Shao-Hua Sun,et al.  Behavioral clusters revealed by end-to-end decoding from microendoscopic imaging , 2021, bioRxiv.

[49]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[50]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[51]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[52]  Suresh Jagannathan,et al.  An inductive synthesis framework for verifiable reinforcement learning , 2019, PLDI.

[53]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[54]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[55]  Dawn Song,et al.  Execution-Guided Neural Program Synthesis , 2018, ICLR.

[56]  Matthew J. Hausknecht,et al.  Neural Program Meta-Induction , 2017, NIPS.

[57]  Lihong Li,et al.  Neuro-Symbolic Program Synthesis , 2016, ICLR.

[58]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[59]  Pat Langley,et al.  Learning Teleoreactive Logic Programs from Problem Solving , 2005, ILP.

[60]  Raphaël Dang-Nhu PLANS: Neuro-Symbolic Program Learning from Videos , 2020, NeurIPS.

[61]  Jiajun Wu,et al.  Learning to Describe Scenes with Programs , 2018, ICLR.

[62]  Gaurav S. Sukhatme,et al.  Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments , 2020, CoRL.

[63]  Shaohua Wang,et al.  DLFix: Context-based Code Transformation Learning for Automated Program Repair , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[64]  Matthew J. Hausknecht,et al.  Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis , 2018, ICLR.

[65]  Martin Monperrus,et al.  DynaMoth: Dynamic Code Synthesis for Automatic Program Repair , 2016, 2016 IEEE/ACM 11th International Workshop in Automation of Software Test (AST).

[66]  Abhinav Verma,et al.  Imitation-Projected Programmatic Reinforcement Learning , 2019, NeurIPS.

[67]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[68]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[69]  Dawn Song,et al.  Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages , 2021, NeurIPS.

[70]  Martin Rinard,et al.  Program Synthesis Guided Reinforcement Learning , 2021, ArXiv.

[71]  Seungjin Choi,et al.  Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace , 2018, ICML.

[72]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[73]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[74]  Christoph Meinel,et al.  Deep Learning for Medical Image Analysis , 2018, Journal of Pathology Informatics.

[75]  Leslie Pack Kaelbling,et al.  Few-Shot Bayesian Imitation Learning with Logical Program Policies , 2020, AAAI.

[76]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[77]  Joseph J. Lim,et al.  Generalization to New Actions in Reinforcement Learning , 2020, ICML.

[78]  Abhinav Verma,et al.  Programmatically Interpretable Reinforcement Learning , 2018, ICML.

[79]  Armando Solar-Lezama,et al.  Verifiable Reinforcement Learning via Policy Extraction , 2018, NeurIPS.

[80]  Silvio Savarese,et al.  Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Michael D. Ernst,et al.  NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System , 2018, LREC.

[82]  Da Xiao,et al.  Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction , 2018, ICLR.

[83]  Dileep George,et al.  Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs , 2018, Science Robotics.

[84]  Jaime F. Fisac,et al.  A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems , 2017, IEEE Transactions on Automatic Control.

[85]  Leslie Pack Kaelbling,et al.  A large-scale benchmark for few-shot program induction and synthesis , 2021, ICML.

[86]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[87]  Joseph J. Lim,et al.  Policy Transfer across Visual and Dynamics Domain Gaps via Iterative Grounding , 2021, Robotics: Science and Systems.

[88]  Richard Socher,et al.  Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning , 2017, ICLR.

[89]  Tim Rocktäschel,et al.  Programming with a Differentiable Forth Interpreter , 2016, ICML.

[90]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[91]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[92]  Ali Mesbah,et al.  DeepDelta: learning to repair compilation errors , 2019, ESEC/SIGSOFT FSE.

[93]  Carlo A. Furia,et al.  Contract-based program repair without the contracts , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[94]  Chelsea Finn,et al.  Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings , 2020, ICML.

[95]  Le Song,et al.  ProTo: Program-Guided Transformer for Program-Guided Tasks , 2021, NeurIPS.

[96]  Armando Solar-Lezama,et al.  Representing Partial Programs with Blended Abstract Semantics , 2020, ArXiv.

[97]  Kevin Swersky,et al.  Neural Execution Engines: Learning to Execute Subroutines , 2020, NeurIPS.

[98]  Armando Solar-Lezama,et al.  DreamCoder: growing generalizable, interpretable knowledge with wake–sleep Bayesian program learning , 2020, Philosophical Transactions of the Royal Society A.

[99]  Joseph J. Lim,et al.  Accelerating Reinforcement Learning with Learned Skill Priors , 2020, CoRL.

[100]  Ashish Kapoor,et al.  Safe Control under Uncertainty with Probabilistic Signal Temporal Logic , 2016, Robotics: Science and Systems.

[101]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[102]  Marc Brockschmidt,et al.  Differentiable Programs with Neural Libraries , 2016, ICML.

[103]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[104]  Jacob Andreas,et al.  Leveraging Language to Learn Program Abstractions and Search Heuristics , 2021, ICML.

[105]  Armando Solar-Lezama,et al.  Synthesizing Programmatic Policies that Inductively Generalize , 2020, ICLR.

[106]  Manuela M. Veloso,et al.  DISTILL: Learning Domain-Specific Planners by Example , 2003, ICML.

[107]  Jiajun Wu,et al.  Neural Scene De-rendering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[109]  Quoc V. Le,et al.  Neural Programmer: Inducing Latent Programs with Gradient Descent , 2015, ICLR.

[110]  Pushmeet Kohli,et al.  Strong Generalization and Efficiency in Neural Programs , 2020, ArXiv.

[111]  Michael Burke,et al.  From explanation to synthesis: Compositional program induction for learning from demonstration , 2019, Robotics: Science and Systems.

[112]  Joseph J. Lim,et al.  To Follow or not to Follow: Selective Imitation Learning from Observations , 2019, CoRL.

[113]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[114]  Honglak Lee,et al.  Hierarchical Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies , 2018, NeurIPS.

[115]  Richard E. Pattis,et al.  Karel the Robot: A Gentle Introduction to the Art of Programming , 1994 .

[116]  Hyeonwoo Noh,et al.  Neural Program Synthesis from Diverse Demonstration Videos , 2018, ICML.

[117]  Joseph J. Lim,et al.  Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation , 2019, NeurIPS.

[118]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.