Exploration in Structured Reinforcement Learning

We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward functions are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes $S$ and $A$ of the state and action spaces, i.e., they are smaller than $c\log T$ where $T$ is the time horizon and the constant $c$ only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as $SA\log T$. We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure.

[1]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[2]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[3]  Alexandre Proutière,et al.  Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms , 2014, ICML.

[4]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Sarah Filippi,et al.  Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[7]  T. L. Graves,et al.  Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains , 1997 .

[8]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[9]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[10]  Ronald Ortner,et al.  Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2015, ICML.

[11]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[12]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[13]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[14]  Alexandre Proutière,et al.  Minimal Exploration in Structured Stochastic Bandits , 2017, NIPS.

[15]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[16]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[17]  Alexandre Proutière,et al.  Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[18]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[19]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.