Regret Bounds for Gaussian Process Bandit Problems

Bandit algorithms are concerned with trading exploration with exploitation where a number of options are available but we can only learn their quality by experimenting with them. We consider the scenario in which the reward distribution for arms is modelled by a Gaussian process and there is no noise in the observed reward. Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about the covariance function dening the Gaussian process. We further complement these upper bounds with corresponding lower bounds for particular covariance functions demonstrating that in general there is at most a logarithmic looseness in our upper bounds.

[1]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[2]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[3]  R. Dudley The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[4]  J. Gittins,et al.  A dynamic allocation index for the discounted multiarmed bandit problem , 1979 .

[5]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[6]  William J. Welch,et al.  Computer experiments and global optimization , 1997 .

[7]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[8]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[9]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[10]  R. Taylor A User's Guide to Measure-Theoretic Probability , 2003 .

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[13]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[14]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[15]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[16]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[17]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[18]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[19]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[20]  D. Ginsbourger,et al.  A Multi-points Criterion for Deterministic Parallel Global Optimization based on Gaussian Processes , 2008 .

[21]  Michael A. Osborne,et al.  Gaussian Processes for Global Optimization , 2008 .

[22]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[23]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[24]  D. Ginsbourger,et al.  Towards GP-based optimization with finite time horizon , 2009 .

[25]  Andreas Krause,et al.  Gaussian Process Bandits without Regret: An Experimental Design Approach , 2009, ArXiv.

[26]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.