In model-based reinforcement learning (MBRL), Wan et al. (2019) showed conditions under which the environment model could produce the expectation of the next feature vector rather than the full distribution, or a sample thereof, with no loss in planning performance. Such expectation models are of interest when the environment is stochastic and non-stationary, and the model is approximate, such as when it is learned using function approximation. In these cases a full distribution model may be impractical and a sample model may be either more expensive computationally or of high variance. Wan et al. considered only planning for prediction to evaluate a fixed policy. In this paper, we treat the control case—planning to improve and find a good approximate policy. We prove that planning with an expectation model must update a state-value function, not an action-value function as previously suggested (e.g., Sorg & Singh, 2010). This opens the question of how planning influences action selections. We consider three strategies for this and present general MBRL algorithms for each. We identify the strengths and weaknesses of these algorithms in computational experiments. Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting. Methods that scale with computation are most likely to stand the test of time. We refer to scaling with computation in the context of artificial intelligence (AI) as using more computation to provide a better approximate answer. This is in contrast to with the common notion of scaling in computer science as using more computation for solving a bigger problem exactly. The recent success of modern machine learning techniques, in particular that of deep learning, is Department of Engineering, University of Guelph, Guelph, Canada Department of Computing Science, University of Alberta, Edmonton, Canada Alberta Machine Intelligence Institute, Edmonton, Canada. Correspondence to: Katya Kudashkina <ekudashk@uoguelph.ca>. Pre-print. Copyright 2021 by the author(s). primarily due to their ability to leverage the ever-increasing computational power (driven by Moore’s Law), as well as their generality in terms of dependence on data rather than hand-crafted features or rule-based techniques. Key to building general-purpose AI systems would be methods that scale with computation (Sutton, 2019). RL is on its way to fully embrace scaling with computation— it already embraces scaling with computation in many aspects, for example by leveraging search techniques (e.g., Monte-Carlo Tree Search, see Browne et al., 2012; Finnsson & Björnsson, 2008), and modern deep learning techniques such as artificial neural networks. Methods that resort to approximating functions rather than learning them exactly have not been fully investigated. Extending the techniques used in the simpler tabular regime to this function approximation regime is an obvious first step, but some of the ideas that have served us well in the past might actually be impeding progress on the new problem of interest. For example, Sutton & Barto (2018) showed that when dealing with feature vectors rather than underlying states, the common Bellman error objective is not learnable with any amount of experiential data. Recently, Naik et al. (2019) showed that discounting is incompatible with function approximation in the case of continuing control tasks. Understanding function approximation in RL is key to building general-purpose intelligent systems that can learn to solve many tasks of arbitrary complexity in the real world. Many of the methods in RL focus on approximating value functions for a given fixed policy—referred to as the prediction problem (e.g., Sutton et al., 1988; Singh et al., 1995; Wan et al., 2019). The more challenging problem of learning the best behavior is known as the control problem—that is approximating optimal policies and optimal value functions. In the control problem, an RL agent learns within one of two broad frameworks: model-free RL and model-based RL (MBRL). In model-free RL, the agent relies solely on its observations to make decisions (Sutton & Barto, 2018). In model-based RL, the agent has a model of the world, which it uses in conjunction with its observations to plan its decisions. The process of taking a model as input and producing or improving a policy for interacting with a modeled environment is referred to as planning. Models and planning are helpful. One advantage of models and planning is that they are useful when the agent faces unfamiliar or novel ar X iv :2 10 4. 08 54 3v 1 [ cs .A I] 1 7 A pr 2 02 1 Planning with Expectation Models for Control situations—when the agent may have to consider states and actions that it has not experienced or seen before. Planning can help the agent evaluate possible actions by rolling out hypothetical scenarios according to the model and then computing their expected future outcomes (Doll et al., 2012; Ha & Schmidhuber, 2018; Sutton & Barto, 2018). Planning with function approximation remains a challenge in reinforcement learning today (Shariff and Szepesvári, 2020). Planning can be performed with various kinds of models: distribution, sample, and expectation. Wan et al. (2019) considered planning with an expectation model for the prediction problem within the function approximation setting. In this paper we extend Wan et al.’s (2019) work on the prediction problem to the more challenging control problem, in the context of stochastic and non-stationary environments. This will involve several important definitions. We start off with discussing important choices in MBRL (Section 1). This is followed by fundamentals on planning with expectation models in the general context of function approximation (Section 2). We then show (in Section 3) that planning with an expectation model for control in stochastic non-stationary environments must update a state-value function and not an action-value function as previously suggested (e.g., Sorg & Singh, 2010). Finally, we consider three ways in which actions can be selected when planning with state-value functions, and identify their relative strengths and weaknesses (Sections 4 & 5). 1. Choices in Model-Based RL Model-based methods are an important part of reinforcement learning’s claim to provide a full account of intelligence. An intelligent agent should be able to model its environment and use that model flexibly and efficiently to plan its behavior. In MBRL, models add knowledge to the agent in a way that policies and value functions do not (van Hasselt et al., 2019). Typically, a model receives a state and an action as inputs, and computes the next state and reward (Kuvayev & Sutton, 1996; Sutton et al., 2008; Hester & Stone, 2011). This output is used in planning to further improve policies and value functions. In this section we discuss three important choices one needs to make in MBRL. Learned models vs. Experience Replay. One choice to make is where planning improvements could come from: learned models or experience replay (ER) (Lin, 1992). In replay-based methods the agent plans using experience stored in the agent’s memory. Some replay-based examples include deep Q-networks (DQN) (Mnih et al. 2013; 2015) and its variations: double-DQN (van Hasselt et al., 2016); extension of DQN to include prioritized sweeping (Schaul et al. 2016); deep deterministic policy gradient (Lillicrap et al., 2016); rainbow DQN (Hessel et al., 2018). MBRL methods in which 1) the model is parameterized by some learnable weights, 2) the agent learns the parameters of the model, and 3) the agent then uses the model to plan an improved policy, are referred to as planning with learned parametric models. Learned parametric models are used in the Dyna architecture (Sutton, 1991), in normalized advantage functions method that incorporates the learned model into the Q-learning algorithm based on imagination rollouts (Gu et al., 2016), and in MuZero (Schrittwieser et al., 2019). The latter is a combination of a replay-based method and a learned model: the model is trained by using trajectories that are sampled from the replay buffer. If the environment is non-stationary, then the transitions stored in the replay buffer might be stale and can slow down or even hinder the learning progress. In this work, we focus on the learned parametric models. Types of Learned Models. Another important choice to make in MBRL is a choice of a model type. A model enables an agent to predict what would happen if actions are executed from states prior to actually executing them and without necessarily being in those states. Given a state and action, the model can predict a sample, an expectation, or a distribution of outcomes, which results in three model-type possibilities. The first possibility is when a model computes a probability p of the next state as a result of the action taken by the agent. We refer to such model as a distribution model. Such models have been used typically with an assumption of a particular kind of distribution such as Gaussian (e.g., Chua et al., 2018). For example, Deisenroth & Rasmussen (2011) proposed a model-based policy search method based on probabilistic inference for learning control where a distribution model is learned using Gaussian processes. Learning a distribution can be challenging: 1) distributions are potentially large objects (Wan et al., 2019); and 2) distribution models may require an efficient way of representing and computing them, and both of the tasks can be challenging (Kurutach et al., 2018). The second possibility is for the model to compute a sample of the next state rather than computing the full distribution. We refer to such model as a sample model. The output of sample models is more compact than the output of distribution models and thus, more computationally feasible. In this sense sample models are similar to experience replay. Sample models have been a good solution to when
[1]
Sean R Eddy,et al.
What is dynamic programming?
,
2004,
Nature Biotechnology.
[2]
Christopher G. Atkeson,et al.
A comparison of direct and model-based reinforcement learning
,
1997,
Proceedings of International Conference on Robotics and Automation.
[3]
N. Daw,et al.
The ubiquity of model-based reinforcement learning
,
2012,
Current Opinion in Neurobiology.
[4]
Taher Jafferjee.
Chasing Hallucinated Value: A Pitfall of Dyna Style Algorithms with Imperfect Environment Models
,
2020
.
[5]
Carl E. Rasmussen,et al.
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
,
2011,
ICML.
[6]
Leslie Pack Kaelbling,et al.
Acting Optimally in Partially Observable Stochastic Domains
,
1994,
AAAI.
[7]
Max Welling,et al.
Auto-Encoding Variational Bayes
,
2013,
ICLR.
[8]
Mark B. Ring.
Continual learning in reinforcement environments
,
1995,
GMD-Bericht.
[9]
Katja Hofmann,et al.
A Deep Learning Approach for Joint Video Frame and Reward Prediction in Atari Games
,
2016,
ICLR 2016.
[10]
J. Albus.
A Theory of Cerebellar Function
,
1971
.
[11]
Sergey Levine,et al.
Model-Based Reinforcement Learning for Atari
,
2019,
ICLR.
[12]
Satinder P. Singh,et al.
Reinforcement Learning with a Hierarchy of Abstract Models
,
1992,
AAAI.
[13]
Richard S. Sutton,et al.
Discounted Reinforcement Learning is Not an Optimization Problem
,
2019,
ArXiv.
[14]
Demis Hassabis,et al.
Mastering the game of Go without human knowledge
,
2017,
Nature.
[15]
Simon M. Lucas,et al.
A Survey of Monte Carlo Tree Search Methods
,
2012,
IEEE Transactions on Computational Intelligence and AI in Games.
[16]
Richard S. Sutton,et al.
Neuronlike adaptive elements that can solve difficult learning control problems
,
1983,
IEEE Transactions on Systems, Man, and Cybernetics.
[17]
Yngvi Björnsson,et al.
Simulation-Based Approach to General Game Playing
,
2008,
AAAI.
[18]
Sergey Levine,et al.
Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning
,
2018,
ArXiv.
[19]
Jürgen Schmidhuber,et al.
Deep learning in neural networks: An overview
,
2014,
Neural Networks.
[20]
Razvan Pascanu,et al.
Imagination-Augmented Agents for Deep Reinforcement Learning
,
2017,
NIPS.
[21]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[22]
Erik Talvitie,et al.
Self-Correcting Models for Model-Based Reinforcement Learning
,
2016,
AAAI.
[23]
N. Whitman.
A bitter lesson.
,
1999,
Academic medicine : journal of the Association of American Medical Colleges.
[24]
Richard S. Sutton,et al.
Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming
,
1990,
ML.
[25]
Dimitri P. Bertsekas,et al.
Dynamic Programming and Optimal Control, Two Volume Set
,
1995
.
[26]
Tom Schaul,et al.
Rainbow: Combining Improvements in Deep Reinforcement Learning
,
2017,
AAAI.
[27]
Richard S. Sutton,et al.
Model-Based Reinforcement Learning with an Approximate, Learned Model
,
1996
.
[28]
Erik Talvitie,et al.
The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces
,
2018,
ArXiv.
[29]
Long Ji Lin,et al.
Self-improving reactive agents based on reinforcement learning, planning and teaching
,
1992,
Machine Learning.
[30]
Adam M White,et al.
DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE
,
2015
.
[31]
Sergey Levine,et al.
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
,
2018,
NeurIPS.
[32]
Ben J. A. Kröse,et al.
Learning from delayed rewards
,
1995,
Robotics Auton. Syst..
[33]
Matteo Hessel,et al.
When to use parametric models in reinforcement learning?
,
2019,
NeurIPS.
[34]
Yuval Tassa,et al.
Continuous control with deep reinforcement learning
,
2015,
ICLR.
[35]
Honglak Lee,et al.
Action-Conditional Video Prediction using Deep Networks in Atari Games
,
2015,
NIPS.
[36]
Richard S. Sutton,et al.
Learning to predict by the methods of temporal differences
,
1988,
Machine Learning.
[37]
Richard S. Sutton,et al.
Temporal Abstraction in Temporal-difference Networks
,
2005,
NIPS.
[38]
Leslie Pack Kaelbling,et al.
Hierarchical Learning in Stochastic Domains: Preliminary Results
,
1993,
ICML.
[39]
Mance E. Harmon,et al.
Spurious Solutions to the Bellman Equation
,
1999
.
[40]
Jing Peng,et al.
Efficient Learning and Planning Within the Dyna Framework
,
1993,
Adapt. Behav..
[41]
Shane Legg,et al.
Human-level control through deep reinforcement learning
,
2015,
Nature.
[42]
Pieter Abbeel,et al.
An Application of Reinforcement Learning to Aerobatic Helicopter Flight
,
2006,
NIPS.
[43]
Richard S. Sutton,et al.
Predictive Representations of State
,
2001,
NIPS.
[44]
Daan Wierstra,et al.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
,
2014,
ICML.
[45]
Harm van Seijen,et al.
The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning
,
2020,
NeurIPS.
[46]
Leslie Pack Kaelbling,et al.
Learning to Achieve Goals
,
1993,
IJCAI.
[47]
Pieter Abbeel,et al.
Model-Ensemble Trust-Region Policy Optimization
,
2018,
ICLR.
[48]
Alex Graves,et al.
Playing Atari with Deep Reinforcement Learning
,
2013,
ArXiv.
[49]
Sergey Levine,et al.
Continuous Deep Q-Learning with Model-based Acceleration
,
2016,
ICML.
[50]
David Silver,et al.
Deep Reinforcement Learning with Double Q-Learning
,
2015,
AAAI.
[51]
Guy Lever,et al.
Deterministic Policy Gradient Algorithms
,
2014,
ICML.
[52]
Martha White,et al.
Hill Climbing on Value Estimates for Search-control in Dyna
,
2019,
IJCAI.
[53]
Demis Hassabis,et al.
Mastering Atari, Go, chess and shogi by planning with a learned model
,
2019,
Nature.
[54]
Alborz Geramifard,et al.
Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping
,
2008,
UAI.
[55]
Honglak Lee,et al.
Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion
,
2018,
NeurIPS.
[56]
John N. Tsitsiklis,et al.
Neuro-Dynamic Programming
,
1996,
Encyclopedia of Machine Learning.
[57]
Roshan Shariff,et al.
Efficient Planning in Large MDPs with Weak Linear Function Approximation
,
2020,
NeurIPS.
[58]
Shalabh Bhatnagar,et al.
Multi-step linear Dyna-style planning
,
2009,
NIPS 2009.
[59]
Jürgen Schmidhuber,et al.
Model-based reinforcement learning for evolving soccer strategies
,
2001
.
[60]
Peter Stone,et al.
Learning and Using Models
,
2012,
Reinforcement Learning.
[61]
Leslie Pack Kaelbling,et al.
Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons
,
1991,
IJCAI.
[62]
Erik Talvitie,et al.
Model Regularization for Stable Sample Rollouts
,
2014,
UAI.
[63]
Martha White,et al.
Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains
,
2018,
IJCAI.
[64]
R.M. Dunn,et al.
Brains, behavior, and robotics
,
1983,
Proceedings of the IEEE.
[65]
Hado van Hasselt,et al.
Double Q-learning
,
2010,
NIPS.
[66]
Doina Precup,et al.
Intra-Option Learning about Temporally Abstract Actions
,
1998,
ICML.
[67]
Michael I. Jordan,et al.
Reinforcement Learning with Soft State Aggregation
,
1994,
NIPS.
[68]
Richard S. Sutton,et al.
Dyna, an integrated architecture for learning, planning, and reacting
,
1990,
SGAR.
[69]
Tom Schaul,et al.
Prioritized Experience Replay
,
2015,
ICLR.
[70]
Masashi Sugiyama,et al.
Statistical Reinforcement Learning - Modern Machine Learning Approaches
,
2015,
Chapman and Hall / CRC machine learning and pattern recognition series.
[71]
W. A. Clark,et al.
Simulation of self-organizing systems by digital computer
,
1954,
Trans. IRE Prof. Group Inf. Theory.
[72]
Martha White,et al.
Planning with Expectation Models
,
2019,
IJCAI.
[73]
C. Atkeson,et al.
Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real
,
1993
.
[74]
Tom Schaul,et al.
Universal Value Function Approximators
,
2015,
ICML.