Model-Based Planning with Discrete and Continuous Actions

Action planning using learned and differentiable forward models of the world is a general approach which has a number of desirable properties, including improved sample complexity over model-free RL methods, reuse of learned models across different tasks, and the ability to perform efficient gradient-based optimization in continuous action spaces. However, this approach does not apply straightforwardly when the action space is discrete. In this work, we show that it is in fact possible to effectively perform planning via backprop in discrete action spaces, using a simple paramaterization of the actions vectors on the simplex combined with input noise when training the forward model. Our experiments show that this approach can match or outperform model-free RL and discrete planning methods on gridworld navigation tasks in terms of performance and/or planning time while using limited environment interactions, and can additionally be used to perform model-based control in a challenging new task where the action space combines discrete and continuous actions. We furthermore propose a policy distillation approach which yields a fast policy network which can be used at inference time, removing the need for an iterative planning procedure.

[1]  S. Dreyfus The numerical solution of variational problems , 1962 .

[2]  Jürgen Schmidhuber,et al.  An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[4]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[5]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[6]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[7]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[8]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[9]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[10]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[11]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[12]  Flavien Balbo,et al.  Using a monte-carlo approach for bus regulation , 2009, 2009 12th International IEEE Conference on Intelligent Transportation Systems.

[13]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[14]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Ashish Sabharwal,et al.  Guiding Combinatorial Optimization with UCT , 2012, CPAIOR.

[16]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[17]  Jordan L. Boyd-Graber,et al.  Don't Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation , 2014, EMNLP.

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[23]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[25]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Sergey Levine,et al.  Deep Reinforcement Learning for Robotic Manipulation , 2016, ArXiv.

[28]  Razvan Pascanu,et al.  Metacontrol for Adaptive Imagination-Based Optimization , 2017, ICLR.

[29]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[30]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[31]  Razvan Pascanu,et al.  Learning model-based planning from scratch , 2017, ArXiv.

[32]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.