Near-optimal Policy Identification in Active Reinforcement Learning

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.

[1]  J. Schneider,et al.  Exploration via Planning for Information about the Optimal Trajectory , 2022, NeurIPS.

[2]  Yu Chen,et al.  Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation , 2022, ICML.

[3]  Shuang Liu,et al.  Provably Efficient Kernelized Q-Learning , 2022, ArXiv.

[4]  Martin A. Riedmiller,et al.  Magnetic control of tokamak plasmas through deep reinforcement learning , 2022, Nature.

[5]  J. Schneider,et al.  An Experimental Design Perspective on Model-Based Reinforcement Learning , 2021, ICLR.

[6]  Andreas Krause,et al.  Misspecified Gaussian Process Bandit Optimization , 2021, NeurIPS.

[7]  Csaba Szepesvari,et al.  Efficient Local Planning with Linear Function Approximation , 2021, ALT.

[8]  Jianqing Fan,et al.  Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model , 2021, NeurIPS.

[9]  Ke Alexander Wang,et al.  Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information , 2021, ICML.

[10]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[11]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[12]  Jos'e Miguel Hern'andez-Lobato,et al.  Symmetry-Aware Actor-Critic for 3D Molecular Design , 2020, ICLR.

[13]  Roshan Shariff,et al.  Efficient Planning in Large MDPs with Weak Linear Function Approximation , 2020, NeurIPS.

[14]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[15]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[16]  Kenneth O. Stanley,et al.  First return, then explore , 2020, Nature.

[17]  E. Kaufmann,et al.  Kernel-Based Reinforcement Learning: A Finite-Time Analysis , 2020, ICML.

[18]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[19]  Michael Pearce,et al.  Practical Bayesian Optimization of Objectives with Conditioning Variables. , 2020 .

[20]  A. Krause,et al.  Distributionally Robust Bayesian Optimization , 2020, AISTATS.

[21]  José Miguel Hernández-Lobato,et al.  Reinforcement Learning for Molecular Design Guided by Quantum Mechanics , 2020, ICML.

[22]  Csaba Szepesvari,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2019, ICML.

[23]  Lin F. Yang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2019, ICLR.

[24]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[25]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[26]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[27]  Kirthevasan Kandasamy,et al.  Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly , 2019, J. Mach. Learn. Res..

[28]  Jürgen Branke,et al.  Continuous multi-task Bayesian Optimisation with correlation , 2018, Eur. J. Oper. Res..

[29]  Wojciech Jaskowski,et al.  Model-Based Active Exploration , 2018, ICML.

[30]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[31]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[32]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[35]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[36]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[37]  D. Ginsbourger,et al.  A benchmark of kriging-based infill criteria for noisy optimization , 2013, Structural and Multidisciplinary Optimization.

[38]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[39]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[40]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[41]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[42]  Alessandro Lazaric,et al.  Multi-Bandit Best Arm Identification , 2011, NIPS.

[43]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[44]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[45]  Warren B. Powell,et al.  The Knowledge-Gradient Policy for Correlated Normal Beliefs , 2009, INFORMS J. Comput..

[46]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[47]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[48]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[49]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[50]  James W. Daniel,et al.  Splines and efficiency in dynamic programming , 1976 .

[51]  F. H. Branin Widely convergent method for finding multiple solutions of simultaneous nonlinear equations , 1972 .

[52]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1963 .

[53]  M. Boyer,et al.  Offline Model-Based Reinforcement Learning for Tokamak Control , 2023, L4DC.

[54]  Michael I. Jordan,et al.  On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces , 2021 .

[55]  Y. Na,et al.  Feedforward beta control in the KSTAR tokamak by deep reinforcement learning , 2021 .

[56]  Kirthevasan Kandasamy,et al.  Offline Contextual Bayesian Optimization , 2019, NeurIPS.

[57]  S. Kakade,et al.  Reinforcement Learning: Theory and Algorithms , 2019 .

[58]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .