Reducing reinforcement learning to KWIK online regression

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large Markov decision processes (MDPs) where compact function approximation has to be used. This paper introduces REKWIRE, a provably efficient, model-free algorithm for finite-horizon RL problems with value function approximation (VFA) that addresses the exploration-exploitation tradeoff in a principled way. The crucial element of this algorithm is a reduction of RL to online regression in the recently proposed KWIK learning model. We show that, if the KWIK online regression problem can be solved efficiently, then the sample complexity of exploration of REKWIRE is polynomial. Therefore, the reduction suggests a new and sound direction to tackle general RL problems. The efficiency of our algorithm is verified on a set of proof-of-concept experiments where popular, ad hoc exploration approaches fail.

[1]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[2]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[3]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[4]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[7]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[8]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[9]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[10]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[11]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[12]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[13]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[14]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[15]  J. Langford,et al.  Reducing T-step reinforcement learning to classifica-tion , 2003 .

[16]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[17]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[18]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[19]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[20]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[21]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Reid G. Simmons,et al.  The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement-Learning Algorithms , 2005, Machine Learning.

[24]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[25]  John Langford,et al.  Relating reinforcement learning performance to classification performance , 2005, ICML '05.

[26]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[27]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[28]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[29]  Peter Stone,et al.  Model-Based Exploration in Continuous State Spaces , 2007, SARA.

[30]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[31]  Vadim Bulitko,et al.  Focus of Attention in Reinforcement Learning , 2007, J. Univers. Comput. Sci..

[32]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[33]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[34]  Lihong Li,et al.  Efficient Value-Function Approximation via Online Linear Regression , 2008, ISAIM.

[35]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[36]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[37]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[38]  Michael L. Littman,et al.  A unifying framework for computational reinforcement learning theory , 2009 .

[39]  Lihong Li,et al.  Online exploration in least-squares policy iteration , 2009, AAMAS.

[40]  Matthieu Geist,et al.  Kalman Temporal Differences: The deterministic case , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.