Upper Confidence Reinforcement Learning exploiting state-action equivalence

Leveraging an equivalence property on the set of states of state-action pairs in an Markov Decision Process (MDP) has been suggested by many authors. We take the study of equivalence classes to the reinforcement learning (RL) setup, when transition distributions are no longer assumed to be known, in a discrete MDP with average reward criterion and no reset. We study powerful similarities between state-action pairs related to optimal transport. We first analyze a variant of the UCRL2 algorithm called C-UCRL2, which highlights the clear benefit of leveraging this equivalence structure when it is known ahead of time: the regret bound scales as ~O(D√KCT) where C is the number of classes of equivalent state-action pairs and K bounds the size of the support of the transitions. A non trivial question is whether this benefit can still be observed when the structure is unknown and must be learned while minimizing the regret. We propose a sound clustering technique that provably learn the unknown classes, but show that its natural combination with UCRL2 empirically fails. Our findings suggests this is due to the ad-hoc criterion for stopping the episodes in UCRL2. We replace it with hypothesis testing, which in turns considerably improves all strategies. It is then empirically validated that learning the structure can be beneficial in a full-blown RL problem.

[1]  Ronald Ortner,et al.  Selecting Near-Optimal Approximate State Representations in Reinforcement Learning , 2014, ALT.

[2]  Shie Mannor,et al.  Model selection in markovian processes , 2013, KDD.

[3]  T. Lai,et al.  Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[4]  Doina Precup,et al.  Methods for Computing State Similarity in Markov Decision Processes , 2006, UAI.

[5]  Ronald Ortner,et al.  Noname manuscript No. (will be inserted by the editor) Adaptive Aggregation for Reinforcement Learning in Average Reward Markov Decision Processes , 2022 .

[6]  Balaraman Ravindran Approximate Homomorphisms : A framework for non-exact minimization in Markov Decision Processes , 2022 .

[7]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[8]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[9]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[10]  Parag Singla,et al.  ASAP-UCT: Abstraction of State-Action Pairs in UCT , 2015, IJCAI.

[11]  Sarah Filippi,et al.  Optimism in reinforcement learning and Kullback-Leibler divergence , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[13]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[14]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[15]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[16]  Robert Givan,et al.  Model Reduction Techniques for Computing Approximately Optimal Solutions for Markov Decision Processes , 1997, UAI.

[17]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[18]  Azadeh Khaleghi,et al.  Online Clustering of Processes , 2012, AISTATS.

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.