论文信息 - Optimism in Reinforcement Learning with Generalized Linear Function Approximation - 字舞流文

Optimism in Reinforcement Learning with Generalized Linear Function Approximation

We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of $\tilde{O}(\sqrt{d^3 T})$ where $d$ is the dimensionality of the state-action features and $T$ is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.

Ruosong Wang | Akshay Krishnamurthy | Yining Wang | Simon S. Du | S. Du | Yining Wang | A. Krishnamurthy | Ruosong Wang

[1] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[2] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[3] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[4] T. Lai,et al. Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[5] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[6] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[8] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[9] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[10] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[11] Michael I. Jordan,et al. Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[12] Mengdi Wang,et al. Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[13] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[14] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[15] Nan Jiang,et al. Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[16] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[17] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.

[18] Francisco S. Melo,et al. Q -Learning with Linear Function Approximation , 2007, COLT.

[19] John Langford,et al. PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[20] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[21] Andrew G. Barto,et al. Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[22] Haipeng Luo,et al. Practical Contextual Bandits with Regression Oracles , 2018, ICML.

[23] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[24] Csaba Szepesvári,et al. Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[25] Ruosong Wang,et al. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[26] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[27] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[28] Max Simchowitz,et al. Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[29] Lihong Li,et al. Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[30] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.