Driving Exploration by Maximum Distribution in Gaussian Process Bandits

The problem of finding optimal solutions of stochastic functions over continuous domains is common in several real-world applications, such as, e.g., advertisement allocation, dynamic pricing, and power control in wireless networks. The optimization process is customarily performed by selecting input points sequentially and receiving a noisy observation from the function. In this paper, we resort to the Multi-Armed Bandit approach, aiming at optimizing stochastic functions when keeping at a pace the regret (i.e., the loss incurred during the learning process) during the learning process. In particular, we focus on smooth stochastic functions, as it is known that any algorithm suffers from a constant per-round regret when the domain is continuous, and the function does not satisfy any kind of regularity. Our main original contribution is the provision of a general family of algorithms, which, under the mild assumption that stochastic functions are a realization of a Gaussian Process, provides a regret of the order ofO ( √ γTT ), being γT the maximum information gain and T the time horizon used for the learning process. Furthermore, we design a specific algorithm of our family, called DAGP-UCB, which exploits the structure of GPs to select the next arm to pull more effectively than the previous algorithms available in the state of the art, thus speeding up the learning process. In particular, we show the superior performance of DAGP-UCB in both synthetic and applicative settings, comparing it with the state-of-the-art algorithms.

[1]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[2]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[3]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[4]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[5]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[6]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[7]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[8]  Marcello Restelli,et al.  Dealing with Interdependencies and Uncertainty in Multi-Channel Advertising Campaigns Optimization , 2019, WWW.

[9]  Zi Wang,et al.  Max-value Entropy Search for Efficient Bayesian Optimization , 2017, ICML.

[10]  Marcello Restelli,et al.  Budgeted Multi-Armed Bandit in Continuous Action Space , 2016, ECAI.

[11]  Jan Peters,et al.  An experimental comparison of Bayesian optimization for bipedal locomotion , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Mung Chiang,et al.  Power Control in Wireless Cellular Networks , 2008, Found. Trends Netw..

[13]  S. Ghosal,et al.  Posterior consistency of Gaussian process prior for nonparametric binary regression , 2006, math/0702686.

[14]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[15]  John Shawe-Taylor,et al.  Regret Bounds for Gaussian Process Bandit Problems , 2010, AISTATS 2010.

[16]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[17]  Marcello Restelli,et al.  Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems , 2017, AAAI.

[18]  Bolei Zhou,et al.  Optimization as Estimation with Gaussian Processes in Bandit Settings , 2015, AISTATS.

[19]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[20]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[21]  Jonas Mockus,et al.  On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[22]  Marcello Restelli,et al.  Estimating Maximum Expected Value through Gaussian Approximation , 2016, ICML.

[23]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[24]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[25]  Marcello Restelli,et al.  Improving multi-armed bandit algorithms in online pricing settings , 2018, Int. J. Approx. Reason..

[26]  Marcello Restelli,et al.  A Combinatorial-Bandit Algorithm for the Online Joint Bid/Budget Optimization of Pay-per-Click Advertising Campaigns , 2018, AAAI.