Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction

The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.

[1]  Shipra Agrawal,et al.  Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[2]  David Abel,et al.  A Theory of Abstraction in Reinforcement Learning , 2022, ArXiv.

[3]  Lawson L. S. Wong,et al.  Bad-Policy Density: A Measure of Reinforcement Learning Hardness , 2021, ArXiv.

[4]  Peter Henderson,et al.  An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning , 2021, ArXiv.

[5]  Zheng Wen,et al.  Reinforcement Learning, Bit by Bit , 2021, Found. Trends Mach. Learn..

[6]  Benjamin Van Roy,et al.  Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States , 2021, J. Mach. Learn. Res..

[7]  Chi Jin,et al.  Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , 2021, NeurIPS.

[8]  Shane Legg,et al.  Meta-trained agents implement Bayes-optimal agents , 2020, NeurIPS.

[9]  Benjamin Van Roy,et al.  Randomized Value Functions via Posterior State-Abstraction Sampling , 2020, ArXiv.

[10]  Shimon Whiteson,et al.  Exploration in Approximate Hyper-State Space for Meta Reinforcement Learning , 2020, ICML.

[11]  S. Kakade,et al.  FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[12]  Doina Precup,et al.  Value Preserving State-Action Abstractions , 2020, AISTATS.

[13]  Benjamin Van Roy,et al.  Provably Efficient Reinforcement Learning with Aggregated States , 2019, ArXiv.

[14]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[15]  Luisa M. Zintgraf,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2019, ICLR.

[16]  Lawson L. S. Wong,et al.  State Abstraction as Compression in Apprenticeship Learning , 2019, AAAI.

[17]  Yee Whye Teh,et al.  Meta-learning of Sequential Strategies , 2019, ArXiv.

[18]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[19]  Christopher Grimm,et al.  Mitigating Planner Overfitting in Model-Based Reinforcement Learning , 2018, ArXiv.

[20]  J. Langford,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[21]  Siddhartha S. Srinivasa,et al.  Bayesian Policy Optimization for Model Uncertainty , 2018, ICLR.

[22]  M. Littman,et al.  State Abstractions for Lifelong Reinforcement Learning , 2018, ICML.

[23]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[24]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[25]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[26]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[27]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[28]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[29]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[30]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[31]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[32]  Nan Jiang,et al.  Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[33]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[34]  Peter Dayan,et al.  Bayes-Adaptive Simulation-based Search with Value Function Approximation , 2014, NIPS.

[35]  Shie Mannor,et al.  How hard is my MDP?" The distribution-norm to the rescue" , 2014, NIPS.

[36]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[37]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[38]  Xiaoping Chen,et al.  Covering Number as a Complexity Measure for POMDP Planning and Learning , 2012, AAAI.

[39]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[40]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[41]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[42]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[43]  Christos Dimitrakakis,et al.  Complexity of Stochastic Branch and Bound Methods for Belief Tree Search in Bayesian Reinforcement Learning , 2009, ICAART.

[44]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[45]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[46]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[47]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[48]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[49]  Nan Rong,et al.  What makes some POMDP problems easy to approximate? , 2007, NIPS.

[50]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[51]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[52]  Doina Precup,et al.  Methods for Computing State Similarity in Markov Decision Processes , 2006, UAI.

[53]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[54]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[55]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[56]  Peter Stone,et al.  State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[57]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[58]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[59]  Michael O. Duff,et al.  Diffusion Approximation for Bayesian Markov Chains , 2003, ICML.

[60]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[61]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[62]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[63]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[64]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[65]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[66]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[67]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[68]  Andrew G. Barto,et al.  Local Bandit Approximation for Optimal Learning Problems , 1996, NIPS.

[69]  Terrence J. Sejnowski,et al.  Exploration Bonuses and Dual Control , 1996, Machine Learning.

[70]  D. Hochbaum Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems , 1996 .

[71]  Benjamin Van Roy,et al.  Feature-based methods for large scale dynamic programming , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[72]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[73]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[74]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[75]  P. Tseng Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[76]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[77]  Ward Whitt,et al.  Approximations of Dynamic Programs, I , 1978, Math. Oper. Res..

[78]  R. Dudley The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[79]  R. Bellman A Markovian Decision Process , 1957 .

[80]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[81]  M. Littman,et al.  Toward Good Abstractions for Lifelong Learning , 2017 .

[82]  J. Asmuth Model-based Bayesian Reinforcement Learning with Generalized Priors , 2013 .

[83]  Stephen W. Carden,et al.  An Introduction to Reinforcement Learning , 2013 .

[84]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[85]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[86]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[87]  S. Kakade On the sample complexity of reinforcement learning. , 2003 .

[88]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[89]  Michael O. Duff,et al.  Monte-Carlo Algorithms for the Improvement of Finite-State Stochastic Controllers: Application to Bayes-Adaptive Markov Decision Processes , 2001, AISTATS.

[90]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[91]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[92]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[93]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[94]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .