论文信息 - REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of O(HS√AT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

Ambuj Tewari | Peter L. Bartlett | P. Bartlett | Ambuj Tewari

[1] Pravin Varaiya,et al. Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[2] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3] Apostolos Burnetas,et al. Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[4] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[5] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[6] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[7] Michael L. Littman,et al. A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[8] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[9] Peter Auer,et al. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[10] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[11] Ambuj Tewari,et al. Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[12] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..