论文信息 - Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL - 字舞流文

Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL

Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP. The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factor of $\sqrt{nH|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the cardinality of the factored state subspace, $H$ is the planning horizon and $n$ is the number of factored transition. To show the optimality of our bounds, we also provide a lower bound for FMDP, which indicates that our algorithm is near-optimal w.r.t. timestep $T$, horizon $H$ and factored state-action subspace cardinality. Finally, as an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK), and provides the first sample-efficient algorithm based on FMDP-BF.

Liwei Wang | Lihong Li | Xiaoyu Chen | Jiachen Hu

[1] Nikhil R. Devanur,et al. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives , 2015, COLT.

[2] Mengdi Wang,et al. Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[3] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[4] Craig Boutilier,et al. Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[5] Aleksandrs Slivkins,et al. Constrained episodic reinforcement learning in concave-convex and knapsack settings , 2020, NeurIPS.

[6] Nicholas Roy,et al. Provably Efficient Learning with Typed Parametric Models , 2009, J. Mach. Learn. Res..

[7] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[8] Lihong Li,et al. Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[9] Zhuoran Yang,et al. Provably Efficient Safe Exploration via Primal-Dual Policy Optimization , 2020, AISTATS.

[10] Xiaoyu Chen,et al. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[11] Shobha Venkataraman,et al. Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[12] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[13] Michael L. Littman,et al. A unifying framework for computational reinforcement learning theory , 2009 .

[14] Benjamin Van Roy,et al. Information-Theoretic Confidence Bounds for Reinforcement Learning , 2019, NeurIPS.

[15] Benjamin Van Roy,et al. Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[16] Lihong Li,et al. Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[17] Suvrit Sra,et al. Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes , 2020, NeurIPS.

[18] Ruosong Wang,et al. Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[19] Mengdi Wang,et al. Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[20] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[21] Alessandro Lazaric,et al. Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[22] Lillian J. Ratliff,et al. Constrained Upper Confidence Reinforcement Learning , 2020, L4DC.

[23] KleinbergRobert,et al. Bandits with Knapsacks , 2018 .

[24] Shie Mannor,et al. Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[25] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[26] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[27] Michael I. Jordan,et al. Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[28] Michael Kearns,et al. Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[29] Benjamin Van Roy,et al. Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[30] Benjamin Van Roy,et al. Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[31] Ness B. Shroff,et al. Learning in Markov Decision Processes under Constraints , 2020, ArXiv.

[32] Paolo Toth,et al. Knapsack Problems: Algorithms and Computer Implementations , 1990 .

[33] John E. Beasley. Multidimensional Knapsack Problems , 2009, Encyclopedia of Optimization.