Regional Multi-Armed Bandits

We consider a variant of the classic multi-armed bandit problem where the expected reward of each arm is a function of an unknown parameter. The arms are divided into different groups, each of which has a common parameter. Therefore, when the player selects an arm at each time slot, information of other arms in the same group is also revealed. This regional bandit model naturally bridges the non-informative bandit setting where the player can only learn the chosen arm, and the global bandit model where sampling one arms reveals information of all arms. We propose an efficient algorithm, UCB-g, that solves the regional bandit problem by combining the Upper Confidence Bound (UCB) and greedy principles. Both parameter-dependent and parameter-free regret upper bounds are derived. We also establish a matching lower bound, which proves the order-optimality of UCB-g. Moreover, we propose SW-UCB-g, which is an extension of UCB-g for a non-stationary environment where the parameters slowly vary over time.

[1]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[2]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[3]  Mihaela van der Schaar,et al.  Global Multi-armed Bandits with Hölder Continuity , 2014, AISTATS.

[4]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[5]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[6]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[7]  Alexandre Proutière,et al.  Unimodal Bandits: Regret Lower Bounds and Optimal Algorithms , 2014, ICML.

[8]  L. Sheiner,et al.  Understanding the Dose-Effect Relationship , 1981, Clinical pharmacokinetics.

[9]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[10]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Jian Huang,et al.  Demand Functions in Decision Modeling: A Comprehensive Survey and Research Directions , 2013, Decis. Sci..

[13]  John N. Tsitsiklis,et al.  A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[14]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[15]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[16]  Vianney Perchet,et al.  Bounded regret in stochastic multi-armed bandits , 2013, COLT.

[17]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.