Learning State Features from Policies to Bias Exploration in Reinforcement Learning

Abstract : When given several problems to solve in some domain, a standard reinforcement learner learns an optimal policy from scratch for each problem. If the domain has particular characteristics that are goal and problem independent, the learner might be able to take advantage of previously solved problems. Unfortunately, it is generally infeasible to directly apply a learned policy to new problems. This paper presents a method to bias exploration through previous problem solutions, which is shown to speed up learning on new problems. We first allow a Q-learner to learn the optimal policies for several problems. We describe each state in terms of local features, assuming that these state features together with the learned policies can be used to abstract out the domain characteristics from the specific layout of states and rewards in a particular problem. We then use a classifier to learn this abstraction by using training examples extracted from each learned Q-table. The trained classifier maps state features to the potentially goal independent successful actions in the domain. Given a new problem, we include the output of the classifier as an exploration bias to improve the rate of convergence of the reinforcement learner. We have validated our approach empirically. In this paper, we report results within the complex domain Sokoban which we introduce.

[1]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[2]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[3]  M. Veloso,et al.  Bounding the suboptimality of reusing subproblems , 1999, IJCAI 1999.

[4]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[5]  Jean-Claude Latombe,et al.  Motion planning in the presence of moving obstacles , 1992 .

[6]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[7]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[8]  Gordon T. Wilfong Motion planning in the presence of movable obstacles , 1988, SCG '88.

[9]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[10]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[11]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[12]  Thomas G. Dietterich,et al.  Hierarchical Explanation-Based Reinforcement Learning , 1997, ICML.

[13]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[14]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[15]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[16]  Jette Randløv,et al.  Learning Macro-Actions in Reinforcement Learning , 1998, NIPS.