论文信息 - Towards Tight Bounds on the Sample Complexity of Average-reward MDPs - 字舞流文

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

We prove new upper and lower bounds for sample complexity of finding an -optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most tmix, we provide an algorithm that solves the problem using Õ(tmix −3) (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on tmix is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

Aaron Sidford | Yujia Jin | Aaron Sidford | Yujia Jin

[1] Xian Wu,et al. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[2] Alexander Gasnikov,et al. Parallel Stochastic Mirror Descent for MDPs , 2021 .

[3] Wotao Yin,et al. How Does an Approximate Model Help in Reinforcement Learning , 2019 .

[4] Richard Peng,et al. Faster Algorithms for Computing the Stationary Distribution, Simulating Random Walks, and More , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[5] Alessandro Lazaric,et al. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[6] Yi Ouyang,et al. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[7] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[8] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[9] Xian Wu,et al. Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[10] Ronald Ortner,et al. Regret Bounds for Reinforcement Learning via Markov Chain Concentration , 2018, J. Artif. Intell. Res..

[11] Aaron Sidford,et al. Efficiently Solving MDPs with Stochastic Mirror Descent , 2020, ICML.

[12] Lin F. Yang,et al. Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT 2020.

[13] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[14] Sridhar Mahadevan,et al. Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[15] Peter Auer,et al. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[16] Kevin Tian,et al. Variance Reduction for Matrix Games , 2019, NeurIPS.

[17] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[18] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[19] Yuantao Gu,et al. Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[20] Mengdi Wang,et al. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time , 2017, ArXiv.

[21] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22] Mengdi Wang,et al. Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems , 2017, ArXiv.

[23] Martin J. Wainwright,et al. Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[24] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .