Algorithm-Directed Exploration for Model-Based Reinforcement Learning in Factored MDPs

One of the central challenges in reinforcement learning is to balance the exploration/exploitation tradeoff while scaling up to large problems. Although model-based reinforcement learning has been less prominent than value-based methods in addressing these challenges, recent progress has generated renewed interest in pursuing modelbased approaches: Theoretical work on the exploration/exploitation tradeoff has yielded provably sound model-based algorithms such as E 3 and Rmax, while work on factored MDP representations has yielded model-based algorithms that can scale up to large problems. Recently the benefits of both achievements have been combined in the Factored E3 algorithm of Kearns and Koller. In this paper, we address a significant shortcoming of Factored E3: namely that it requires an oracle planner that cannot be feasibly implemented. We propose an alternative approach that uses a practical approximate planner, approximate linear programming, that maintains desirable properties. Further, we develop an exploration strategy that is targeted toward improving the performance of the linear programming algorithm, rather than an oracle planner. This leads to a simple exploration strategy that visits states relevant to tightening the LP solution, and achieves sample efficiency logarithmic in the size of the problem description. Our experimental results show that the targeted approach performs better than using approximate planning for implementing either Factored E3 or Factored Rmax.

[1]  John B. Kidd,et al.  Decisions with Multiple Objectives—Preferences and Value Tradeoffs , 1977 .

[2]  R. L. Keeney,et al.  Decisions with Multiple Objectives: Preferences and Value Trade-Offs , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[4]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[5]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[6]  F. B. Vernadat,et al.  Decisions with Multiple Objectives: Preferences and Value Tradeoffs , 1994 .

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[9]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[10]  Benjamin Van Roy Learning and value function approximation in complex decision processes , 1998 .

[11]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[12]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.

[13]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[14]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[15]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[16]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[17]  Judy Goldsmith,et al.  Nonapproximability Results for Partially Observable Markov Decision Processes , 2011, Universität Trier, Mathematik/Informatik, Forschungsbericht.

[18]  Eric Allender,et al.  Complexity of finite-horizon Markov decision process problems , 2000, JACM.

[19]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[20]  Dale Schuurmans,et al.  Direct value-approximation for factored MDPs , 2001, NIPS.

[21]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[22]  Timothy X. Brown,et al.  Switch Packet Arbitration via Queue-Learning , 2001, NIPS.

[23]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[24]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[25]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[26]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.