论文信息 - On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems

On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems

Multi-armed bandit problems are considered as a paradigm of the trade-off between exploring the environment to find profitable actions and exploiting what is already known. In the stationary case, the distributions of the rewards do not change in time, Upper-Confidence Bound (UCB) policies have been shown to be rate optimal. A challenging variant of the MABP is the non-stationary bandit problem where the gambler must decide which arm to play while facing the possibility of a changing environment. In this paper, we consider the situation where the distributions of rewards remain constant over epochs and change at unknown time instants. We analyze two algorithms: the discounted UCB and the sliding-window UCB. We establish for these two algorithms an upper-bound for the expected regret by upper-bounding the expectation of the number of times a suboptimal arm is played. For that purpose, we derive a Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor.

Aurélien Garivier | É. Moulines

[1] P. Whittle. Restless Bandits: Activity Allocation in a Changing World , 1988 .

[2] J. Gani. A celebration of applied probability , 1990 .

[3] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[4] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[6] G. Lugosi,et al. On Prediction of Individual Sequences , 1998 .

[7] Sanjeev R. Kulkarni,et al. Finite-time lower bounds for the two-armed bandit problem , 2000, IEEE Trans. Autom. Control..

[8] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[9] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[10] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11] Mark Herbster,et al. Tracking the Best Expert , 1995, Machine Learning.

[12] Michèle Sebag,et al. Multi-armed Bandit, Dynamic Environments and Meta-Bandits , 2006 .

[13] Nicolò Cesa-Bianchi,et al. Regret Minimization Under Partial Monitoring , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este.

[14] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[15] Csaba Szepesvári,et al. Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[16] Eli Upfal,et al. Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[17] A. S. Xanthopoulos,et al. Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems , 2008, Appl. Math. Comput..

[18] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .