论文信息 - Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

Near-optimal Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms for the Non-episodic Setting

We study reinforcement learning in factored Markov decision processes (FMDPs) in the non-episodic setting. We focus on regret analyses providing both upper and lower bounds. We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound $\widetilde{O}(DS\sqrt{AT})$ for standard non-factored MDPs. Our lower bound depends on the span of the bias vector rather than the diameter $D$ and we show via a simple Cartesian product construction that FMDPs with a bounded span can have an arbitrarily large diameter, which suggests that bounds with a dependence on diameter can be extremely loose. We, therefore, propose another algorithm that only depends on span but relies on a computationally stronger oracle. Our algorithms outperform the previous near-optimal algorithms on computer network administrator simulations.

Ambuj Tewari | Ziping Xu | Ambuj Tewari | Ziping Xu

[1] Peter Stone,et al. Structure Learning in Ergodic Factored MDPs without Knowledge of the Transition Function's In-Degree , 2011, ICML.

[2] Robert Givan,et al. Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[3] Shipra Agrawal,et al. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[4] Keiji Kanazawa,et al. A model for reasoning about persistence and causation , 1989 .

[5] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[6] Shobha Venkataraman,et al. Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[7] Christos Dimitrakakis,et al. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities , 2019, ArXiv.

[8] Sarah Filippi,et al. Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[9] Martin L. Puterman,et al. A probabilistic analysis of bias optimality in unichain Markov decision processes , 2001, IEEE Trans. Autom. Control..

[10] Haipeng Luo,et al. Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits , 2016, NIPS.

[11] Benjamin Van Roy,et al. Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[12] Eli Upfal,et al. Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[13] Michael L. Littman,et al. Efficient Structure Learning in Factored-State MDPs , 2007, AAAI.

[14] Haipeng Luo,et al. Efficient Contextual Bandits in Non-stationary Worlds , 2017, COLT.

[15] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[16] Alessandro Lazaric,et al. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[17] Zoubin Ghahramani,et al. Learning Dynamic Bayesian Networks , 1997, Summer School on Neural Networks.

[18] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[19] Alexander L. Strehl,et al. Model-Based Reinforcement Learning in Factored-State MDPs , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[20] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[21] Craig Boutilier,et al. Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[22] Shie Mannor,et al. Off-policy Model-based Learning under Unknown Factored Dynamics , 2015, ICML.

[23] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[24] Yi Ouyang,et al. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[25] Carlos Guestrin,et al. Max-norm Projections for Factored MDPs , 2001, IJCAI.

[26] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[27] Dale Schuurmans,et al. Algorithm-Directed Exploration for Model-Based Reinforcement Learning in Factored MDPs , 2002, ICML.

[28] Michael Kearns,et al. Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.