论文信息 - Best-Arm Identification in Correlated Multi-Armed Bandits

Best-Arm Identification in Correlated Multi-Armed Bandits

In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability <inline-formula> <tex-math notation="LaTeX">$1-\delta $ </tex-math></inline-formula> for some <inline-formula> <tex-math notation="LaTeX">$\delta >0$ </tex-math></inline-formula>, the arm with the highest mean reward in minimum possible samples from the set of arms <inline-formula> <tex-math notation="LaTeX">$\mathcal {K}$ </tex-math></inline-formula>. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form <inline-formula> <tex-math notation="LaTeX">$\mathrm {O}\left({\sum _{k \in \mathcal {C}} \log \left({\frac {1}{\delta }}\right)}\right)$ </tex-math></inline-formula> as opposed to the typical <inline-formula> <tex-math notation="LaTeX">$\mathrm {O}\left({\sum _{k \in \mathcal {K}} \log \left({\frac {1}{\delta }}\right)}\right)$ </tex-math></inline-formula> samples required in the independent reward setting. The improvement comes, as the <inline-formula> <tex-math notation="LaTeX">$\mathrm {O}(\log (1/\delta))$ </tex-math></inline-formula> term is summed only for the set of <italic>competitive</italic> arms <inline-formula> <tex-math notation="LaTeX">$\mathcal {C}$ </tex-math></inline-formula>, which is a subset of the original set of arms <inline-formula> <tex-math notation="LaTeX">$\mathcal {K}$ </tex-math></inline-formula>. The size of the set <inline-formula> <tex-math notation="LaTeX">$\mathcal {C}$ </tex-math></inline-formula>, depending on the problem setting, can be as small as 2, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets.

[1] Jack Bowden,et al. Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[2] E. Paulson. A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[3] Samarth Gupta,et al. Multi-Armed Bandits with Correlated Arms , 2019, ArXiv.

[4] Tom'avs Koc'ak,et al. Best Arm Identification in Spectral Bandits , 2020, IJCAI.

[5] Wouter M. Koolen,et al. Non-Asymptotic Pure Exploration by Solving Games , 2019, NeurIPS.

[6] Gauri Joshi,et al. A Unified Approach to Translate Classical Bandit Algorithms to the Structured Bandit Setting , 2018, IEEE Journal on Selected Areas in Information Theory.

[7] Ambuj Tewari,et al. PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[8] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[9] Shie Mannor,et al. PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[10] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[11] Matthew Malloy,et al. lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[12] Mihaela van der Schaar,et al. Global Multi-armed Bandits with Hölder Continuity , 2014, AISTATS.

[13] Cong Shen,et al. Regional Multi-Armed Bandits , 2018, AISTATS.

[14] Cong Shen,et al. Generalized Global Bandit and Its Application in Cellular Coverage Optimization , 2018, IEEE Journal of Selected Topics in Signal Processing.

[15] Bart P. G. Van Parys,et al. Optimal Learning for Structured Bandits , 2020, SSRN Electronic Journal.

[16] Tor Lattimore,et al. Bounded Regret for Finite-Armed Structured Bandits , 2014, NIPS.

[17] Robert Nowak,et al. A KL-LUCB algorithm for Large-Scale Crowdsourcing , 2017, NIPS.

[18] Robert E. Bechhofer,et al. A Sequential Multiple-Decision Procedure for Selecting the Best One of Several Normal Populations with a Common Unknown Variance, and Its Use with Various Experimental Designs , 1958 .

[19] Alessandro Lazaric,et al. Best-Arm Identification in Linear Bandits , 2014, NIPS.

[20] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[21] Shivaram Kalyanakrishnan,et al. Information Complexity in Bandit Subset Selection , 2013, COLT.

[22] Matthew Malloy,et al. On Finding the Largest Mean Among Many , 2013, ArXiv.

[23] Max Simchowitz,et al. The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime , 2017, COLT.

[24] Mengting Wan,et al. Item recommendation on monotonic behavior chains , 2018, RecSys.

[25] Robert D. Nowak,et al. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[26] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[27] Jasjeet S. Sekhon,et al. Time-uniform, nonparametric, nonasymptotic confidence sequences , 2020, The Annals of Statistics.

[28] Oren Somekh,et al. Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[29] Samarth Gupta,et al. Correlated Multi-Armed Bandits with A Latent Random Source , 2018, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[31] John Myles White,et al. Bandit Algorithms for Website Optimization , 2012 .

[32] Yuan Zhou,et al. Best Arm Identification in Linear Bandits with Linear Dimension Dependency , 2018, ICML.

[33] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[34] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.

[35] Alexandre Proutière,et al. Minimal Exploration in Structured Stochastic Bandits , 2017, NIPS.

[36] Csaba Szepesvári,et al. Structured Best Arm Identification with Fixed Confidence , 2017, ALT.

[37] Ameet Talwalkar,et al. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[38] Aurélien Garivier,et al. Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[39] Michal Valko,et al. Fixed-Confidence Guarantees for Bayesian Best-Arm Identification , 2019, AISTATS.