Best-Arm Identification in Correlated Multi-Armed Bandits

In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability <inline-formula> <tex-math notation="LaTeX">$1-\delta $ </tex-math></inline-formula> for some <inline-formula> <tex-math notation="LaTeX">$\delta >0$ </tex-math></inline-formula>, the arm with the highest mean reward in minimum possible samples from the set of arms <inline-formula> <tex-math notation="LaTeX">$\mathcal {K}$ </tex-math></inline-formula>. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form <inline-formula> <tex-math notation="LaTeX">$\mathrm {O}\left({\sum _{k \in \mathcal {C}} \log \left({\frac {1}{\delta }}\right)}\right)$ </tex-math></inline-formula> as opposed to the typical <inline-formula> <tex-math notation="LaTeX">$\mathrm {O}\left({\sum _{k \in \mathcal {K}} \log \left({\frac {1}{\delta }}\right)}\right)$ </tex-math></inline-formula> samples required in the independent reward setting. The improvement comes, as the <inline-formula> <tex-math notation="LaTeX">$\mathrm {O}(\log (1/\delta))$ </tex-math></inline-formula> term is summed only for the set of <italic>competitive</italic> arms <inline-formula> <tex-math notation="LaTeX">$\mathcal {C}$ </tex-math></inline-formula>, which is a subset of the original set of arms <inline-formula> <tex-math notation="LaTeX">$\mathcal {K}$ </tex-math></inline-formula>. The size of the set <inline-formula> <tex-math notation="LaTeX">$\mathcal {C}$ </tex-math></inline-formula>, depending on the problem setting, can be as small as 2, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets.

[1]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[2]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[3]  Samarth Gupta,et al.  Multi-Armed Bandits with Correlated Arms , 2019, ArXiv.

[4]  Tom'avs Koc'ak,et al.  Best Arm Identification in Spectral Bandits , 2020, IJCAI.

[5]  Wouter M. Koolen,et al.  Non-Asymptotic Pure Exploration by Solving Games , 2019, NeurIPS.

[6]  Gauri Joshi,et al.  A Unified Approach to Translate Classical Bandit Algorithms to the Structured Bandit Setting , 2018, IEEE Journal on Selected Areas in Information Theory.

[7]  Ambuj Tewari,et al.  PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[8]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[9]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[10]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[11]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[12]  Mihaela van der Schaar,et al.  Global Multi-armed Bandits with Hölder Continuity , 2014, AISTATS.

[13]  Cong Shen,et al.  Regional Multi-Armed Bandits , 2018, AISTATS.

[14]  Cong Shen,et al.  Generalized Global Bandit and Its Application in Cellular Coverage Optimization , 2018, IEEE Journal of Selected Topics in Signal Processing.

[15]  Bart P. G. Van Parys,et al.  Optimal Learning for Structured Bandits , 2020, SSRN Electronic Journal.

[16]  Tor Lattimore,et al.  Bounded Regret for Finite-Armed Structured Bandits , 2014, NIPS.

[17]  Robert Nowak,et al.  A KL-LUCB algorithm for Large-Scale Crowdsourcing , 2017, NIPS.

[18]  Robert E. Bechhofer,et al.  A Sequential Multiple-Decision Procedure for Selecting the Best One of Several Normal Populations with a Common Unknown Variance, and Its Use with Various Experimental Designs , 1958 .

[19]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[20]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[21]  Shivaram Kalyanakrishnan,et al.  Information Complexity in Bandit Subset Selection , 2013, COLT.

[22]  Matthew Malloy,et al.  On Finding the Largest Mean Among Many , 2013, ArXiv.

[23]  Max Simchowitz,et al.  The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime , 2017, COLT.

[24]  Mengting Wan,et al.  Item recommendation on monotonic behavior chains , 2018, RecSys.

[25]  Robert D. Nowak,et al.  Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[26]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[27]  Jasjeet S. Sekhon,et al.  Time-uniform, nonparametric, nonasymptotic confidence sequences , 2020, The Annals of Statistics.

[28]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[29]  Samarth Gupta,et al.  Correlated Multi-Armed Bandits with A Latent Random Source , 2018, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[31]  John Myles White,et al.  Bandit Algorithms for Website Optimization , 2012 .

[32]  Yuan Zhou,et al.  Best Arm Identification in Linear Bandits with Linear Dimension Dependency , 2018, ICML.

[33]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[34]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[35]  Alexandre Proutière,et al.  Minimal Exploration in Structured Stochastic Bandits , 2017, NIPS.

[36]  Csaba Szepesvári,et al.  Structured Best Arm Identification with Fixed Confidence , 2017, ALT.

[37]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[38]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[39]  Michal Valko,et al.  Fixed-Confidence Guarantees for Bayesian Best-Arm Identification , 2019, AISTATS.