Model-Based Planning in Discrete Action Spaces

Planning actions using learned and differentiable forward models of the world is a general approach which has a number of desirable properties, including improved sample complexity over model-free RL methods, reuse of learned models across different tasks, and the ability to perform efficient gradient-based optimization in continuous action spaces. However, this approach does not apply straightforwardly when the action space is discrete, which may have limited its adoption. In this work, we introduce two discrete planning tasks inspired by existing question-answering datasets and show that it is in fact possible to effectively perform planning via backprop in discrete action spaces using two simple yet principled modifications. Our experiments show that this approach can significantly outperform model-free RL based methods and supervised imitation learners.

[1]  Jason Weston,et al.  Tracking the World State with Recurrent Entity Networks , 2016, ICLR.

[2]  Razvan Pascanu,et al.  Metacontrol for Adaptive Imagination-Based Optimization , 2017, ICLR.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Sergey Levine,et al.  Deep Reinforcement Learning for Robotic Manipulation , 2016, ArXiv.

[6]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[7]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[8]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[9]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[10]  Guozhong An,et al.  The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[11]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[12]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[13]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Jordan L. Boyd-Graber,et al.  Don't Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation , 2014, EMNLP.

[15]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[16]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[17]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[18]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[20]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[21]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[22]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[23]  Jürgen Schmidhuber,et al.  An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[24]  S. Dreyfus The numerical solution of variational problems , 1962 .

[25]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[26]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[27]  B. Widrow,et al.  Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[28]  Jason Weston,et al.  The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[29]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[30]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[31]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[32]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[33]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.