Exploring model-based methods for reinforcement learning
暂无分享,去创建一个
In the area of unsupervised reinforcement learning, where we model the environment as a Markov Decision Process, the goal is to gather information about this system to produce an optimal strategy for navigating this environment. The states of the environment have rewards associated with entering each state. An optimal policy for navigating this system will optimize the long-term rewards. Producing an optimal policy often requires the intermediate step of policy evaluation. A variety of methods have been used to perform policy evaluation, the most popular of which is Temporal Differencing. Temporal Differencing is an efficient model-free method that saves on storage space since no model of the environment is explicitly stored. Model-based methods of policy evaluation have generally been less popular because of perceived slower execution times and greater storage costs, especially as the state space size grows. This thesis counter-acts those limitations by demonstrating efficient model-based policy evaluation approaches with a linear in state space size storage cost and more accurate value estimates of policies than Temporal Difference methods. The thesis uses two model-based approaches, a maximum likelihood method for sparsely and densely connected networks and a matrix inversion approach for intermediate cases. As state space size grows, a least-squares approximation may be applied to these model-based methods.