论文信息 - Efficient Exploration and Value Function Generalization in Deterministic Systems

Efficient Exploration and Value Function Generalization in Deterministic Systems

We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q* lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE[Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q* does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets.

Zheng Wen | Benjamin Van Roy | Zheng Wen

[1] Ronald Ortner,et al. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[2] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[3] Warren B. Powell,et al. Optimal Learning , 2022, Encyclopedia of Machine Learning and Data Mining.

[4] Panos M. Pardalos,et al. Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[5] Peter Auer,et al. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[6] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[8] Tor Lattimore,et al. The Sample-Complexity of General Reinforcement Learning , 2013, ICML.

[9] Alessandro Lazaric,et al. Regret Bounds for Reinforcement Learning with Policy Advice , 2013, ECML/PKDD.

[10] Adel Javanmard,et al. Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems , 2012, NIPS.

[11] Narendra Karmarkar,et al. A new polynomial-time algorithm for linear programming , 1984, STOC '84.

[12] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[13] Benjamin Van Roy,et al. Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[14] Csaba Szepesvári,et al. Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[15] Lihong Li,et al. Reducing reinforcement learning to KWIK online regression , 2010, Annals of Mathematics and Artificial Intelligence.

[16] Geoffrey J. Gordon. Online Fitted Reinforcement Learning , 1995 .

[17] Warren B. Powell,et al. Optimal Learning: Powell/Optimal , 2012 .

[18] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[19] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[20] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[22] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[23] Thomas J. Walsh,et al. Knows what it knows: a framework for self-aware learning , 2008, ICML.

[24] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[25] John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[26] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[27] Michael Kearns,et al. Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[28] Benjamin Van Roy. Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[29] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.