Efficient Value-Function Approximation via Online Linear Regression

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large or even continuous Markov decision processes (MDPs), where compact function approximation has to be used. In this paper, we provide a provably efficient, model-free RL algorithm for finite-horizon problems with linear value-function approximation that addresses the exploration-exploitation tradeoff in a principled way. The key element of this algorithm is the use of a hypothesized online linear-regression algorithm in the recently proposed KWIK framework. We show that, if the sample complexity of the KWIK online linear-regression algorithm is polynomial, then the sample complexity of exploration of the RL algorithm is also polynomial. Such a connection provides a promising approach to efficient RL with function approximation via studying a simpler setting.

[1]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[2]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[3]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[4]  Ethan Bernstein Absolute error bounds for learning linear functions online , 1992, COLT '92.

[5]  Peter Auer,et al.  An Improved On-line Algorithm for Learning Linear Evaluation Functions , 2000, COLT.

[6]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[9]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[12]  Benjamin Van Roy Learning and value function approximation in complex decision processes , 1998 .

[13]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[14]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[15]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[16]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[17]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[18]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[19]  Hans Ulrich Simon,et al.  From noise-free to noise-tolerant and from on-line to batch learning , 1995, COLT '95.

[20]  Rica Gonen,et al.  An incentive-compatible multi-armed bandit mechanism , 2007, PODC '07.

[21]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[22]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[23]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[24]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[25]  Philip M. Long On-line evaluation and prediction using linear functions , 1997, COLT '97.

[26]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[27]  Reid G. Simmons,et al.  The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement-Learning Algorithms , 2005, Machine Learning.

[28]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[29]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[30]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[31]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[32]  Philip M. Long,et al.  Reinforcement Learning with Immediate Rewards and Linear Hypotheses , 2003, Algorithmica.

[33]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[34]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[35]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[36]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.