论文信息 - Programmatically Interpretable Reinforcement Learning

Programmatically Interpretable Reinforcement Learning

We present a reinforcement learning framework, called Programmatically Interpretable Reinforcement Learning (PIRL), that is designed to generate interpretable and verifiable agent policies. Unlike the popular Deep Reinforcement Learning (DRL) paradigm, which represents policies by neural networks, PIRL represents policies using a high-level, domain-specific programming language. Such programmatic policies have the benefits of being more easily interpreted than neural networks, and being amenable to verification by symbolic methods. We propose a new method, called Neurally Directed Program Search (NDPS), for solving the challenging nonsmooth optimization problem of finding a programmatic policy with maximal reward. NDPS works by first learning a neural policy network using DRL, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural "oracle". We evaluate NDPS on the task of learning to drive a simulated car in the TORCS car-racing environment. We demonstrate that NDPS is able to discover human-readable policies that pass some significant performance bars. We also show that PIRL policies can have smoother trajectories, and can be more easily transferred to environments not encountered during training, than corresponding policies discovered by DRL.

[1] Joshua B. Tenenbaum,et al. Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[2] Sebastian Nowozin,et al. DeepCoder: Learning to Write Programs , 2016, ICLR.

[3] Alborz Geramifard,et al. RLPy: a value-function-based reinforcement learning framework for education and research , 2015, J. Mach. Learn. Res..

[4] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5] Stefan Schaal,et al. Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[6] Daniele Loiacono,et al. Learning to overtake in TORCS using simple reinforcement learning , 2010, IEEE Congress on Evolutionary Computation.

[7] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[8] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9] Andrew W. Moore,et al. Variable Resolution Dynamic Programming , 1991, ML Workshop.

[10] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[11] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[12] Armando Solar-Lezama,et al. The Sketching Approach to Program Synthesis , 2009, APLAS.

[13] Rajeev Alur,et al. Syntax-guided synthesis , 2013, 2013 Formal Methods in Computer-Aided Design.

[14] Koushik Sen,et al. Symbolic execution for software testing: three decades later , 2013, CACM.

[15] Mykel J. Kochenderfer,et al. Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[16] Wojciech Samek,et al. Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[17] Marcin Andrychowicz,et al. Neural Random Access Machines , 2015, ERCIM News.

[18] Tore Hägglund,et al. Automatic tuning of simple regulators with specifications on phase and amplitude margins , 1984, Autom..

[19] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[20] Zachary Chase Lipton. The mythos of model interpretability , 2016, ACM Queue.

[21] Quoc V. Le,et al. Neural Program Synthesis with Priority Queue Training , 2018, ArXiv.

[22] Isil Dillig,et al. Synthesizing data structure transformations from input-output examples , 2015, PLDI.

[23] Shie Mannor,et al. Graying the black box: Understanding DQNs , 2016, ICML.

[24] Tomas Mikolov,et al. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[25] Murray Shanahan,et al. Towards Deep Symbolic Reinforcement Learning , 2016, ArXiv.

[26] Jasper Snoek,et al. Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[27] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[28] Matthew J. Hausknecht,et al. Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis , 2018, ICLR.

[29] Swarat Chaudhuri,et al. Neural Sketch Learning for Conditional Program Generation , 2017, ICLR.

[30] Juan Julián Merelo Guervós,et al. Driving in TORCS Using Modular Fuzzy Controllers , 2017, EvoApplications.

[31] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[32] Murray Shanahan,et al. Perception as Abduction: Turning Sensor Data Into Meaningful Representation , 2005, Cogn. Sci..

[33] Pushmeet Kohli,et al. RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[34] Jürgen Schmidhuber,et al. Evolving large-scale neural networks for vision-based TORCS , 2013, FDG.

[35] Karl Johan Åström,et al. PID Controllers: Theory, Design, and Tuning , 1995 .

[36] James C. King,et al. Symbolic execution and program testing , 1976, CACM.