论文信息 - Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction - 字舞流文

Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction

The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.

Satinder Singh | Satinder Singh | Dilip Arumugam

[1] Shipra Agrawal,et al. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[2] David Abel,et al. A Theory of Abstraction in Reinforcement Learning , 2022, ArXiv.

[3] Lawson L. S. Wong,et al. Bad-Policy Density: A Measure of Reinforcement Learning Hardness , 2021, ArXiv.

[4] Peter Henderson,et al. An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning , 2021, ArXiv.

[5] Zheng Wen,et al. Reinforcement Learning, Bit by Bit , 2021, Found. Trends Mach. Learn..

[6] Benjamin Van Roy,et al. Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States , 2021, J. Mach. Learn. Res..

[7] Chi Jin,et al. Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , 2021, NeurIPS.

[8] Shane Legg,et al. Meta-trained agents implement Bayes-optimal agents , 2020, NeurIPS.

[9] Benjamin Van Roy,et al. Randomized Value Functions via Posterior State-Abstraction Sampling , 2020, ArXiv.

[10] Shimon Whiteson,et al. Exploration in Approximate Hyper-State Space for Meta Reinforcement Learning , 2020, ICML.

[11] S. Kakade,et al. FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[12] Doina Precup,et al. Value Preserving State-Action Abstractions , 2020, AISTATS.

[13] Benjamin Van Roy,et al. Provably Efficient Reinforcement Learning with Aggregated States , 2019, ArXiv.

[14] Akshay Krishnamurthy,et al. Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[15] Luisa M. Zintgraf,et al. VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2019, ICLR.

[16] Lawson L. S. Wong,et al. State Abstraction as Compression in Apprenticeship Learning , 2019, AAAI.

[17] Yee Whye Teh,et al. Meta-learning of Sequential Strategies , 2019, ArXiv.

[18] Nan Jiang,et al. Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[19] Christopher Grimm,et al. Mitigating Planner Overfitting in Model-Based Reinforcement Learning , 2018, ArXiv.

[20] J. Langford,et al. Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[21] Siddhartha S. Srinivasa,et al. Bayesian Policy Optimization for Model Uncertainty , 2018, ICLR.

[22] M. Littman,et al. State Abstractions for Lifelong Reinforcement Learning , 2018, ICML.

[23] Xian Wu,et al. Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[24] Ian Osband,et al. The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[25] Zheng Wen,et al. Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[26] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[27] Benjamin Van Roy,et al. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[28] Michael L. Littman,et al. Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[29] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[30] Marc G. Bellemare,et al. Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[31] Shie Mannor,et al. Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[32] Nan Jiang,et al. Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[33] Nan Jiang,et al. The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[34] Peter Dayan,et al. Bayes-Adaptive Simulation-based Search with Value Function Approximation , 2014, NIPS.

[35] Shie Mannor,et al. How hard is my MDP?" The distribution-norm to the rescue" , 2014, NIPS.

[36] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[37] Peter Dayan,et al. Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[38] Xiaoping Chen,et al. Covering Number as a Complexity Measure for POMDP Planning and Learning , 2012, AAAI.

[39] Olivier Buffet,et al. Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[40] Peter Dayan,et al. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[41] Amir Massoud Farahmand,et al. Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[42] Richard L. Lewis,et al. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[43] Christos Dimitrakakis,et al. Complexity of Stochastic Branch and Bound Methods for Belief Tree Search in Bayesian Reinforcement Learning , 2009, ICAART.

[44] Lihong Li,et al. Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[45] Lihong Li,et al. A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[46] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[47] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[48] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[49] Nan Rong,et al. What makes some POMDP problems easy to approximate? , 2007, NIPS.

[50] Doina Precup,et al. Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[51] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[52] Doina Precup,et al. Methods for Computing State Similarity in Markov Decision Processes , 2006, UAI.

[53] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[54] Benjamin Van Roy. Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[55] Tao Wang,et al. Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[56] Peter Stone,et al. State Abstraction Discovery from Irrelevant State Variables , 2005, IJCAI.

[57] Doina Precup,et al. Metrics for Finite Markov Decision Processes , 2004, AAAI.

[58] Michael O. Duff,et al. Design for an Optimal Probe , 2003, ICML.

[59] Michael O. Duff,et al. Diffusion Approximation for Bayesian Markov Chains , 2003, ICML.

[60] John Langford,et al. Exploration in Metric State Spaces , 2003, ICML.

[61] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[62] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[63] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[64] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[65] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[66] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[67] Robert Givan,et al. Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[68] Andrew G. Barto,et al. Local Bandit Approximation for Optimal Learning Problems , 1996, NIPS.

[69] Terrence J. Sejnowski,et al. Exploration Bonuses and Dual Control , 1996, Machine Learning.

[70] D. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems , 1996 .

[71] Benjamin Van Roy,et al. Feature-based methods for large scale dynamic programming , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[72] Leslie Pack Kaelbling,et al. On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[73] Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[74] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[75] P. Tseng. Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[76] D. Bertsekas,et al. Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[77] Ward Whitt,et al. Approximations of Dynamic Programs, I , 1978, Math. Oper. Res..

[78] R. Dudley. The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[79] R. Bellman. A Markovian Decision Process , 1957 .

[80] Xian Wu,et al. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[81] M. Littman,et al. Toward Good Abstractions for Lifelong Learning , 2017 .

[82] J. Asmuth. Model-based Bayesian Reinforcement Learning with Generalized Priors , 2013 .

[83] Stephen W. Carden,et al. An Introduction to Reinforcement Learning , 2013 .

[84] Sanjoy Dasgupta,et al. Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[85] Pascal Poupart,et al. Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[86] Thomas J. Walsh,et al. Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[87] S. Kakade. On the sample complexity of reinforcement learning. , 2003 .

[88] Andrew G. Barto,et al. Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[89] Michael O. Duff,et al. Monte-Carlo Algorithms for the Improvement of Finite-State Stochastic Controllers: Application to Bayes-Adaptive Markov Decision Processes , 2001, AISTATS.

[90] Vijay R. Konda,et al. Actor-Critic Algorithms , 1999, NIPS.

[91] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[92] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[93] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[94] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .