Safe Exploration for Optimization with Gaussian Processes

We consider sequential decision problems under uncertainty, where we seek to optimize an unknown function from noisy samples. This requires balancing exploration (learning about the objective) and exploitation (localizing the maximum), a problem well-studied in the multiarmed bandit literature. In many applications, however, we require that the sampled function values exceed some prespecified "safety" threshold, a requirement that existing algorithms fail to meet. Examples include medical applications where patient comfort must be guaranteed, recommender systems aiming to avoid user dissatisfaction, and robotic control, where one seeks to avoid controls causing physical harm to the platform. We tackle this novel, yet rich, set of problems under the assumption that the unknown function satisfies regularity conditions expressed via a Gaussian process prior. We develop an efficient algorithm called SAFEOPT, and theoretically guarantee its convergence to a natural notion of optimum reachable under safety constraints. We evaluate SAFEOPT on synthetic data, as well as two real applications: movie recommendation, and therapeutic spinal cord stimulation.

[1]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[2]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[3]  Larry A. Wasserman,et al.  Active Learning For Identifying Function Threshold Boundaries , 2005, NIPS.

[4]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[5]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[6]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[7]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[8]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[9]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[10]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[11]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[12]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[13]  Christie K. Ferreira,et al.  Effect of epidural stimulation of the lumbosacral spinal cord on voluntary movement, standing, and assisted stepping after motor complete paraplegia: a case study , 2011, The Lancet.

[14]  Claire J. Tomlin,et al.  Guaranteed safe online learning of a bounded system , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Javier García,et al.  Safe Exploration of State and Action Spaces in Reinforcement Learning , 2012, J. Artif. Intell. Res..

[16]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[17]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[18]  Alkis Gotovos,et al.  Active Learning for Level Set Estimation , 2022 .