论文信息 - Correlated Multi-Armed Bandits with A Latent Random Source

Correlated Multi-Armed Bandits with A Latent Random Source

Multi-armed bandit models are widely studied sequential decision-making problems that exemplify the exploration-exploitation trade-off. We study a novel correlated multi-armed bandit model where the rewards obtained from the arms are functions of a common latent random variable. We propose and analyze the performance of the C-UCB algorithm that leverages the correlations between arms to reduce the cumulative regret (i.e., to increase the total reward obtained after T rounds). Unlike the standard UCB algorithm that pulls all sub-optimal arms O(log T) times, the C-UCB algorithm takes only O(1) times to identify that some arms, which we refer to as non-competitive arms, are optimal. Thus, we effectively reduce a K-armed bandit problem to a C + 1−armed bandit problem with C < K denoting the number of competitive, where C can be computed from the reward functions. A key consequence is that when C = 0, our algorithm achieves a constant (i.e., O(1)) regret instead of the standard O(log T) scaling with the number of rounds T . Establishing lower bounds for the regret, we show that the C-UCB algorithm is order-wise optimal and demonstrate its superiority against other algorithms via numerical simulations.

[1] Samarth Gupta,et al. Exploiting Correlation in Finite-Armed Structured Bandits , 2018, ArXiv.

[2] Deepayan Chakrabarti,et al. Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[3] Shie Mannor,et al. Latent Bandits , 2014, ICML.

[4] Sanjay Shakkottai,et al. Regret of Queueing Bandits , 2016, NIPS.

[5] Jack Bowden,et al. Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[6] Li Zhou,et al. A Survey on Contextual Multi-armed Bandits , 2015, ArXiv.

[7] Aurélien Garivier,et al. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[8] Gauri Joshi,et al. Efficient redundancy techniques to reduce delay in Cloud systems , 2016 .

[9] Tor Lattimore,et al. Bounded Regret for Finite-Armed Structured Bandits , 2014, NIPS.

[10] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[11] Samarth Gupta,et al. Active Distribution Learning from Indirect Samples , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12] Mihaela van der Schaar,et al. Reinforcement learning for energy-efficient wireless transmission , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[14] Madalina M. Drugan,et al. Correlated Gaussian Multi-Objective Multi-Armed Bandit Across Arms Algorithm , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[15] Bee-Chung Chen,et al. Explore/Exploit Schemes for Web Content Optimization , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[16] Cem Tekin,et al. Multi-objective Contextual Multi-armed Bandit Problem with a Dominant Objective , 2017, ArXiv.

[17] John N. Tsitsiklis,et al. A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[18] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[19] Vaibhav Srivastava,et al. Correlated Multiarmed Bandit Problem: Bayesian Algorithms and Regret Analysis , 2015, ArXiv.

[20] John Langford,et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[21] Cong Shen,et al. Regional Multi-Armed Bandits , 2018, AISTATS.

[22] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[23] Nando de Freitas,et al. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[24] Alexandre Proutière,et al. Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[25] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[27] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[28] Jian Huang,et al. Demand Functions in Decision Modeling: A Comprehensive Survey and Research Directions , 2013, Decis. Sci..

[29] Devavrat Shah,et al. A Latent Source Model for Online Collaborative Filtering , 2014, NIPS.

[30] Alexandre Proutière,et al. Minimal Exploration in Structured Stochastic Bandits , 2017, NIPS.

[31] Mihaela van der Schaar,et al. Global Multi-armed Bandits with Hölder Continuity , 2014, AISTATS.

[32] Alexandros G. Dimakis,et al. Identifying Best Interventions through Online Importance Sampling , 2017, ICML.

[33] Bhaskar Krishnamachari,et al. Stochastic Contextual Bandits with Known Reward Functions , 2016, ArXiv.

[34] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.