Learning Powerful Policies by Using Consistent Dynamics Model

Model-based Reinforcement Learning approaches have the promise of being sample efficient. Much of the progress in learning dynamics models in RL has been made by learning models via supervised learning. But traditional model-based approaches lead to `compounding errors' when the model is unrolled step by step. Essentially, the state transitions that the learner predicts (by unrolling the model for multiple steps) and the state transitions that the learner experiences (by acting in the environment) may not be consistent. There is enough evidence that humans build a model of the environment, not only by observing the environment but also by interacting with the environment. Interaction with the environment allows humans to carry out experiments: taking actions that help uncover true causal relationships which can be used for building better dynamics models. Analogously, we would expect such interactions to be helpful for a learning agent while learning to model the environment dynamics. In this paper, we build upon this intuition by using an auxiliary cost function to ensure consistency between what the agent observes (by acting in the real world) and what it imagines (by acting in the `learned' world). We consider several tasks - Mujoco based control tasks and Atari games - and show that the proposed approach helps to train powerful policies and better dynamics models.

[1]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[2]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Richard S. Sutton,et al.  Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[5]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[6]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[7]  Noah D. Goodman,et al.  Where science starts: Spontaneous experiments in preschoolers’ exploratory play , 2011, Cognition.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Erik Talvitie,et al.  Model Regularization for Stable Sample Rollouts , 2014, UAI.

[10]  Emanuel Todorov,et al.  Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[12]  Bryan C. Daniels,et al.  Automated adaptive inference of phenomenological dynamical models , 2015, Nature Communications.

[13]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Ross A. Knepper,et al.  DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[16]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[17]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[18]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[19]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[20]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[21]  Shie Mannor,et al.  Reinforcement Learning in Robust Markov Decision Processes , 2013, Math. Oper. Res..

[22]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[23]  Gabriel Kalweit,et al.  Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[24]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[25]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[26]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[27]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[29]  Fabio Viola,et al.  Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.