Efficient Exploration and Value Function Generalization in Deterministic Systems

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q* lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE[Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q* does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets.

[1]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  Warren B. Powell,et al.  Optimal Learning , 2022, Encyclopedia of Machine Learning and Data Mining.

[4]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[5]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[6]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[8]  Tor Lattimore,et al.  The Sample-Complexity of General Reinforcement Learning , 2013, ICML.

[9]  Alessandro Lazaric,et al.  Regret Bounds for Reinforcement Learning with Policy Advice , 2013, ECML/PKDD.

[10]  Adel Javanmard,et al.  Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems , 2012, NIPS.

[11]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, STOC '84.

[12]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[13]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[14]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[15]  Lihong Li,et al.  Reducing reinforcement learning to KWIK online regression , 2010, Annals of Mathematics and Artificial Intelligence.

[16]  Geoffrey J. Gordon Online Fitted Reinforcement Learning , 1995 .

[17]  Warren B. Powell,et al.  Optimal Learning: Powell/Optimal , 2012 .

[18]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[19]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[22]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[23]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML.

[24]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[25]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[26]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[27]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[28]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[29]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.