Weighted Gaussian Process Bandits for Non-stationary Environments

In this paper, we consider the Gaussian process (GP) bandit optimization problem in a non-stationary environment. To capture external changes, the black-box function is allowed to be time-varying within a reproducing kernel Hilbert space (RKHS). To this end, we develop WGP-UCB, a novel UCB-type algorithm based on weighted Gaussian process regression. A key challenge is how to cope with infinite-dimensional feature maps. To that end, we leverage kernel approximation techniques to prove a sublinear regret bound, which is the first (frequentist) sublinear regret guarantee on weighted time-varying bandits with general nonlinear rewards. This result generalizes both non-stationary linear bandits and standard GPUCB algorithms. Further, a novel concentration inequality is achieved for weighted Gaussian process regression with general weights. We also provide universal upper bounds and weight-dependent upper bounds for weighted maximum information gains. These results are potentially of independent interest for applications such as news ranking and adaptive pricing, where weights can be adopted to capture the importance or quality of data. Finally, we conduct experiments to highlight the favorable gains of the proposed algorithm in many cases when compared to existing methods.

[1]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[2]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[3]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[4]  Dino Sejdinovic,et al.  Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences , 2018, ArXiv.

[5]  Ness Shroff,et al.  No-Regret Algorithms for Time-Varying Bayesian Optimization , 2021, 2021 55th Annual Conference on Information Sciences and Systems (CISS).

[6]  Zhi-Hua Zhou,et al.  A Simple Approach for Non-stationary Linear Bandits , 2020, AISTATS.

[7]  Olivier Capp'e,et al.  Algorithms for Non-Stationary Generalized Linear Bandits , 2020, ArXiv.

[8]  Volkan Cevher,et al.  Time-Varying Gaussian Process Bandit Optimization , 2016, AISTATS.

[9]  David Simchi-Levi,et al.  Learning to Optimize under Non-Stationarity , 2018, AISTATS.

[10]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[11]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[12]  Mikhail Belkin,et al.  Approximation beats concentration? An approximation view on inference with smooth radial kernels , 2018, COLT.

[13]  Feng Fu,et al.  Risk-aware multi-armed bandit problem with application to portfolio selection , 2017, Royal Society Open Science.

[14]  Omar Besbes,et al.  Non-Stationary Stochastic Optimization , 2013, Oper. Res..

[15]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[16]  Ambuj Tewari,et al.  Randomized Exploration for Non-Stationary Stochastic Linear Bandits , 2019, UAI.

[17]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[18]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[19]  Lilian Besson,et al.  The Generalized Likelihood Ratio Test meets klUCB: an Improved Algorithm for Piece-Wise Non-Stationary Bandits , 2019, ArXiv.

[20]  Fang Liu,et al.  A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem , 2017, AAAI.

[21]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[22]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[23]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[24]  Aditya Gopalan,et al.  Bayesian Optimization under Heavy-tailed Payoffs , 2019, NeurIPS.

[25]  Andreas Krause,et al.  Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features , 2018, NeurIPS.

[26]  Peter Auer,et al.  Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes , 2019, COLT.

[27]  Olivier Cappé,et al.  Weighted Linear Bandits for Non-Stationary Environments , 2019, NeurIPS.

[28]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[29]  Zheng Wen,et al.  Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit , 2018, AISTATS.

[30]  Peng Zhao,et al.  Non-stationary Linear Bandits Revisited , 2021, ArXiv.

[31]  Arnold Neumaier,et al.  Introduction to Numerical Analysis , 2001 .

[32]  Sattar Vakili,et al.  On Information Gain and Regret Bounds in Gaussian Process Bandits , 2020, AISTATS.