论文信息 - Batch Value-function Approximation with Only Realizability - 字舞流文

Batch Value-function Approximation with Only Realizability

We make progress in a long-standing problem of batch reinforcement learning (RL): learning $Q^\star$ from an exploratory and polynomial-sized dataset, using a realizable and otherwise arbitrary function class. In fact, all existing algorithms demand function-approximation assumptions stronger than realizability, and the mounting negative evidence has led to a conjecture that sample-efficient learning is impossible in this setting (Chen and Jiang, 2019). Our algorithm, BVFT, breaks the hardness conjecture (albeit under a stronger notion of exploratory data) via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action partition constructed from the compared functions. We also discuss how BVFT can be applied to model selection among other extensions and open problems.

Nan Jiang | Tengyang Xie | Nan Jiang | Tengyang Xie

[1] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[2] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[3] Sergey Levine,et al. Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[4] Nan Jiang,et al. Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[5] Nan Jiang,et al. Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[6] Jiawei Huang,et al. Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization , 2020, ArXiv.

[7] Csaba Szepesvári,et al. Model Selection in Reinforcement Learning , 2011, Machine Learning.

[8] Shie Mannor,et al. Model selection in markovian processes , 2013, KDD.

[9] Emma Brunskill,et al. Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[10] Nan Jiang,et al. On Value Functions and the Agent-Environment Boundary , 2019, ArXiv.

[11] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[12] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[13] Thomas J. Walsh,et al. Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[14] Masatoshi Uehara,et al. Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[15] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[16] Csaba Szepesvári,et al. A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[17] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.

[18] Nan Jiang,et al. $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[19] Emma Brunskill,et al. Provably Good Batch Reinforcement Learning Without Great Exploration , 2020, ArXiv.

[20] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[21] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[22] Balaraman Ravindran. Approximate Homomorphisms : A framework for non-exact minimization in Markov Decision Processes , 2022 .

[23] Nando de Freitas,et al. Hyperparameter Selection for Offline Reinforcement Learning , 2020, ArXiv.

[24] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[25] Luc Devroye,et al. Combinatorial methods in density estimation , 2001, Springer series in statistics.

[26] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[27] Rémi Munos,et al. Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[28] André da Motta Salles Barreto,et al. Policy Iteration Based on Stochastic Factorization , 2014, J. Artif. Intell. Res..

[29] Ward Whitt,et al. Approximations of Dynamic Programs, I , 1978, Math. Oper. Res..

[30] Shimon Whiteson,et al. EFFICIENT ABSTRACTION SELECTION IN REINFORCEMENT LEARNING , 2013, Comput. Intell..

[31] Yishay Mansour,et al. Approximate Equivalence of Markov Decision Processes , 2003, COLT.

[32] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[33] Qiang Liu,et al. Accountable Off-Policy Evaluation With Kernel Bellman Statistics , 2020, ICML.

[34] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[35] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[36] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.