Synthesizing Programmatic Policies that Inductively Generalize

Deep reinforcement learning has successfully solved a number of challenging control tasks. However, learned policies typically have difficulty generalizing to novel environments. We propose an algorithm for learning programmatic state machine policies that can capture repeating behaviors. By doing so, they have the ability to generalize to instances requiring an arbitrary number of repetitions, a property we call inductive generalization. However, state machine policies are hard to learn since they consist of a combination of continuous and discrete structure. We propose a learning framework called adaptive teaching, which learns a state machine policy by imitating a teacher; in contrast to traditional imitation learning, our teacher adaptively updates itself based on the structure of the student. We show how our algorithm can be used to learn policies that inductively generalize to novel environments, whereas traditional neural network policies fail to do so.

[1]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[2]  Abhinav Verma,et al.  Programmatically Interpretable Reinforcement Learning , 2018, ICML.

[3]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[4]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[5]  Armando Solar-Lezama,et al.  Learning to Infer Graphics Programs from Hand-Drawn Images , 2017, NeurIPS.

[6]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[7]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[8]  Mayur Naik,et al.  Learning Neurosymbolic Generative Models via Program Synthesis , 2019, ICML.

[9]  Abhinav Verma,et al.  Imitation-Projected Policy Gradient for Programmatic Reinforcement Learning , 2019, NeurIPS 2019.

[10]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[11]  Swarat Chaudhuri,et al.  HOUDINI: Lifelong Learning as Program Synthesis , 2018, NeurIPS.

[12]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[13]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[14]  David Andre,et al.  State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.

[15]  Dawn Xiaodong Song,et al.  Assessing Generalization in Deep Reinforcement Learning , 2018, ArXiv.

[16]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[17]  Giuseppe C. Calafiore,et al.  Convex Relaxations for Pose Graph Optimization With Outliers , 2018, IEEE Robotics and Automation Letters.

[18]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[19]  Suresh Jagannathan,et al.  An inductive synthesis framework for verifiable reinforcement learning , 2019, PLDI.

[20]  Dawn Xiaodong Song,et al.  Making Neural Programming Architectures Generalize via Recursion , 2017, ICLR.

[21]  Armando Solar-Lezama,et al.  Verifiable Reinforcement Learning via Policy Extraction , 2018, NeurIPS.

[22]  Armando Solar-Lezama,et al.  Unsupervised Learning by Program Synthesis , 2015, NIPS.

[23]  Martin T. Vechev,et al.  Program Synthesis for Character Level Language Modeling , 2016, ICLR.

[24]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.