论文信息 - How hard is my MDP?" The distribution-norm to the rescue" - 字舞流文

How hard is my MDP?" The distribution-norm to the rescue"

In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel p. In many problems, a good approximation of p is not needed. For instance, if from one state-action pair (s, a), one can only transit to states with the same value, learning p(· |s, a) accurately is irrelevant (only its support matters). This paper aims at capturing such behavior by defining a novel hardness measure for Markov Decision Processes (MDPs) based on what we call the distribution-norm. The distribution-norm w.r.t. a measure v is defined on zero v-mean functions f by the standard variation of f with respect to v. We first provide a concentration inequality for the dual of the distribution-norm. This allows us to replace the problem-free, loose ‖ · ‖1 concentration inequalities used in most previous analysis of RL algorithms, with a tighter problem-dependent hardness measure. We then show that several common RL benchmarks have low hardness when measured using the new norm. The distribution-norm captures finer properties than the number of states or the diameter and can be used to assess the difficulty of MDPs.

Shie Mannor | Timothy A. Mann | Odalric-Ambrym Maillard | Shie Mannor | Odalric-Ambrym Maillard

[1] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[2] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[3] Andrew G. Barto,et al. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[4] Ronald Ortner,et al. Selecting Near-Optimal Approximate State Representations in Reinforcement Learning , 2014, ALT.

[5] Tor Lattimore,et al. PAC Bounds for Discounted MDPs , 2012, ALT.

[6] Andrew G. Barto,et al. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[7] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[8] Ronald Ortner,et al. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[9] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[10] Sarah Filippi,et al. Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[11] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12] Thomas G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[13] E. Ordentlich,et al. Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[14] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[16] Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[17] Shie Mannor,et al. Time-Regularized Interrupting Options (TRIO) , 2014, ICML.

[18] Shie Mannor,et al. Temporal Difference Methods for the Variance of the Reward To Go , 2013, ICML.

[19] Amir Massoud Farahmand,et al. Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[20] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[21] Peter Stone,et al. Generalized model learning for reinforcement learning in factored domains , 2009, AAMAS.

[22] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..