Imitation-Projected Policy Gradient for Programmatic Reinforcement Learning

We present Imitation-Projected Policy Gradient (IPPG), an algorithmic framework for learning policies that are parsimoniously represented in a structured programming language. Such programmatic policies can be more interpretable, generalizable, and amenable to formal verification than neural policies; however, designing rigorous learning approaches for programmatic policies remains a challenge. IPPG, our response to this challenge, is based on three insights. First, we view our learning task as optimization in policy space, modulo the constraint that the desired policy has a programmatic representation, and solve this optimization problem using a "lift-and-project" perspective that takes a gradient step into the unconstrained policy space and then projects back onto the constrained space. Second, we view the unconstrained policy space as mixing neural and programmatic representations, which enables employing state-of-the-art deep policy gradient approaches. Third, we cast the projection step as program synthesis via imitation learning, and exploit contemporary combinatorial methods for this task. We present theoretical convergence results for IPPG, as well as an empirical evaluation in three continuous control domains. The experiments show that IPPG can significantly outperform state-of-the-art approaches for learning programmatic policies.

[1]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[2]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[3]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[4]  Byron Boots,et al.  Accelerating Imitation Learning with Predictive Models , 2018, AISTATS.

[5]  C. Cordell Green,et al.  What Is Program Synthesis? , 1985, J. Autom. Reason..

[6]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[7]  Abhinav Verma,et al.  Programmatically Interpretable Reinforcement Learning , 2018, ICML.

[8]  Armando Solar-Lezama,et al.  Learning to Infer Graphics Programs from Hand-Drawn Images , 2017, NeurIPS.

[9]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[10]  Swarat Chaudhuri,et al.  HOUDINI: Lifelong Learning as Program Synthesis , 2018, NeurIPS.

[11]  Swarat Chaudhuri,et al.  Bridging boolean and quantitative synthesis using smoothed proof search , 2014, POPL.

[12]  Yisong Yue,et al.  Smooth Imitation Learning for Online Sequence Prediction , 2016, ICML.

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Byron Boots,et al.  Dual Policy Iteration , 2018, NeurIPS.

[15]  Rajeev Alur,et al.  Syntax-guided synthesis , 2013, 2013 Formal Methods in Computer-Aided Design.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[18]  Swarat Chaudhuri,et al.  Neural Sketch Learning for Conditional Program Generation , 2017, ICLR.

[19]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[20]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[21]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[22]  Nolan Wagener,et al.  Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[23]  Isil Dillig,et al.  Synthesizing data structure transformations from input-output examples , 2015, PLDI.

[24]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[25]  Sanjit A. Seshia,et al.  Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[26]  Arjun Radhakrishna,et al.  Scaling Enumerative Program Synthesis via Divide and Conquer , 2017, TACAS.

[27]  Ufuk Topcu,et al.  Safe Reinforcement Learning via Shielding , 2017, AAAI.

[28]  Swarat Chaudhuri,et al.  Control Regularization for Reduced Variance Reinforcement Learning , 2019, ICML.

[29]  Sumit Gulwani,et al.  FlashMeta: a framework for inductive program synthesis , 2015, OOPSLA.

[30]  Armando Solar-Lezama,et al.  Verifiable Reinforcement Learning via Policy Extraction , 2018, NeurIPS.

[31]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[32]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[33]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[34]  Krishnendu Chatterjee,et al.  Better Quality in Synthesis through Quantitative Objectives , 2009, CAV.

[35]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[36]  Byron Boots,et al.  Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[37]  Tore Hägglund,et al.  Automatic tuning of simple regulators with specifications on phase and amplitude margins , 1984, Autom..

[38]  Byron Boots,et al.  Predictor-Corrector Policy Optimization , 2018, ICML.

[39]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[40]  R. Bellman,et al.  On the “bang-bang” control problem , 1956 .

[41]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[42]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[43]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[44]  Yun Li,et al.  PID control system analysis, design, and technology , 2005, IEEE Transactions on Control Systems Technology.

[45]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.