Self-Consistent Models and Values

Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment. In particular, models enable planning, i.e. using more computation to improve value functions or policies, without requiring additional environment interactions. In this work, we investigate a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly self-consistent. Our approach differs from classic planning methods such as Dyna, which only update values to be consistent with the model. We propose multiple self-consistency updates, evaluate these in both tabular and function approximation settings, and find that, with appropriate choices, self-consistency helps both policy evaluation and control.

[1]  Erik Talvitie,et al.  The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces , 2018, ArXiv.

[2]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[3]  Ondrej Bojar,et al.  Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[4]  Wenlong Fu,et al.  Model-based reinforcement learning: A survey , 2018 .

[5]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[6]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[7]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[8]  Sergey Levine,et al.  Self-Supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[11]  Leslie Pack Kaelbling,et al.  Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[12]  Neil Genzlinger A. and Q , 2006 .

[13]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[14]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[15]  Matteo Hessel,et al.  Podracer architectures for scalable Reinforcement Learning , 2021, ArXiv.

[16]  Satinder Singh,et al.  The Value Equivalence Principle for Model-Based Reinforcement Learning , 2020, NeurIPS.

[17]  Jürgen Schmidhuber,et al.  An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[18]  David Silver,et al.  Muesli: Combining Improvements in Policy Optimization , 2021, ICML.

[19]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[20]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[21]  Amir-massoud Farahmand,et al.  Iterative Value-Aware Model Learning , 2018, NeurIPS.

[22]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[23]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[24]  Petr Baudis,et al.  PACHI: State of the Art Open Source Go Program , 2011, ACG.

[25]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[26]  David Silver,et al.  Online and Offline Reinforcement Learning by Planning with a Learned Model , 2021, NeurIPS.

[27]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[28]  Tao Yu,et al.  PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning , 2021, NeurIPS.

[29]  Martin A. Riedmiller,et al.  Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.

[30]  R. Bellman A Markovian Decision Process , 1957 .

[31]  Sergey Levine,et al.  PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning , 2021, ICML.

[32]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[33]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[34]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[35]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[36]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[37]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[38]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[39]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[40]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[41]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[42]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[43]  Jessica B. Hamrick,et al.  On the role of planning in model-based deep reinforcement learning , 2020, ArXiv.

[44]  Daniel Nikovski,et al.  Value-Aware Loss Function for Model-based Reinforcement Learning , 2017, AISTATS.

[45]  Amir-massoud Farahmand,et al.  Frequency-based Search-control in Dyna , 2020, ICLR.

[46]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[47]  Shimon Whiteson,et al.  TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.

[48]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[49]  Shimon Whiteson,et al.  Deep Residual Reinforcement Learning , 2019, AAMAS.

[50]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[51]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[52]  Doina Precup,et al.  Value-driven Hindsight Modelling , 2020, NeurIPS.

[53]  Doina Precup,et al.  Forethought and Hindsight in Credit Assignment , 2020, NeurIPS.

[54]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[55]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[56]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[57]  Fabio Viola,et al.  Causally Correct Partial Models for Reinforcement Learning , 2020, ArXiv.