Correlated Multi-Armed Bandits with A Latent Random Source

Multi-armed bandit models are widely studied sequential decision-making problems that exemplify the exploration-exploitation trade-off. We study a novel correlated multi-armed bandit model where the rewards obtained from the arms are functions of a common latent random variable. We propose and analyze the performance of the C-UCB algorithm that leverages the correlations between arms to reduce the cumulative regret (i.e., to increase the total reward obtained after T rounds). Unlike the standard UCB algorithm that pulls all sub-optimal arms O(log T) times, the C-UCB algorithm takes only O(1) times to identify that some arms, which we refer to as non-competitive arms, are optimal. Thus, we effectively reduce a K-armed bandit problem to a C + 1−armed bandit problem with C < K denoting the number of competitive, where C can be computed from the reward functions. A key consequence is that when C = 0, our algorithm achieves a constant (i.e., O(1)) regret instead of the standard O(log T) scaling with the number of rounds T . Establishing lower bounds for the regret, we show that the C-UCB algorithm is order-wise optimal and demonstrate its superiority against other algorithms via numerical simulations.

[1]  Samarth Gupta,et al.  Exploiting Correlation in Finite-Armed Structured Bandits , 2018, ArXiv.

[2]  Deepayan Chakrabarti,et al.  Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[3]  Shie Mannor,et al.  Latent Bandits , 2014, ICML.

[4]  Sanjay Shakkottai,et al.  Regret of Queueing Bandits , 2016, NIPS.

[5]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[6]  Li Zhou,et al.  A Survey on Contextual Multi-armed Bandits , 2015, ArXiv.

[7]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[8]  Gauri Joshi,et al.  Efficient redundancy techniques to reduce delay in Cloud systems , 2016 .

[9]  Tor Lattimore,et al.  Bounded Regret for Finite-Armed Structured Bandits , 2014, NIPS.

[10]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[11]  Samarth Gupta,et al.  Active Distribution Learning from Indirect Samples , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12]  Mihaela van der Schaar,et al.  Reinforcement learning for energy-efficient wireless transmission , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[14]  Madalina M. Drugan,et al.  Correlated Gaussian Multi-Objective Multi-Armed Bandit Across Arms Algorithm , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[15]  Bee-Chung Chen,et al.  Explore/Exploit Schemes for Web Content Optimization , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[16]  Cem Tekin,et al.  Multi-objective Contextual Multi-armed Bandit Problem with a Dominant Objective , 2017, ArXiv.

[17]  John N. Tsitsiklis,et al.  A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[18]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[19]  Vaibhav Srivastava,et al.  Correlated Multiarmed Bandit Problem: Bayesian Algorithms and Regret Analysis , 2015, ArXiv.

[20]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[21]  Cong Shen,et al.  Regional Multi-Armed Bandits , 2018, AISTATS.

[22]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[23]  Nando de Freitas,et al.  On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[24]  Alexandre Proutière,et al.  Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms , 2014, COLT.

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[27]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[28]  Jian Huang,et al.  Demand Functions in Decision Modeling: A Comprehensive Survey and Research Directions , 2013, Decis. Sci..

[29]  Devavrat Shah,et al.  A Latent Source Model for Online Collaborative Filtering , 2014, NIPS.

[30]  Alexandre Proutière,et al.  Minimal Exploration in Structured Stochastic Bandits , 2017, NIPS.

[31]  Mihaela van der Schaar,et al.  Global Multi-armed Bandits with Hölder Continuity , 2014, AISTATS.

[32]  Alexandros G. Dimakis,et al.  Identifying Best Interventions through Online Importance Sampling , 2017, ICML.

[33]  Bhaskar Krishnamachari,et al.  Stochastic Contextual Bandits with Known Reward Functions , 2016, ArXiv.

[34]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.