Multi-armed Bandit with Additional Observations

We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. In the case of stochastic rewards, we develop a new algorithm KL-UCB-AO which is asymptotically optimal when the time horizon grows large, by smartly identifying the optimal set of the arms to be explored using the given budget of additional observations. In the case of adversarial rewards, we propose H-INF, an algorithm with order-optimal regret. H-INF exploits a two-layered structure where in each layer, we run a known optimal MAB algorithm. Such a hierarchical structure facilitates the regret analysis of the algorithm, and in turn, yields order-optimal regret. We apply the framework of MAB with additional observations to the design of rate adaptation schemes in 802.11-like wireless systems, and to that of online advertisement systems. In both cases, we demonstrate that our algorithms leverage additional observations to significantly improve the system performance. We believe the techniques developed in this paper are of independent interest for other MAB problems, e.g., contextual or graph-structured MAB.

[1]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[2]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[3]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[4]  Koby Crammer,et al.  Prediction with Limited Advice and Multiarmed Bandits with Paid Observations , 2014, ICML.

[5]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[6]  Koby Crammer,et al.  Open Problem: Adversarial Multiarmed Bandits with Limited Advice , 2013, COLT.

[7]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[8]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[9]  Deepak S. Turaga,et al.  Budgeted Prediction with Expert Advice , 2015, AAAI.

[10]  Tao Qin,et al.  Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising , 2013, NIPS.

[11]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[12]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[13]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[14]  J. J. Garcia-Luna-Aceves,et al.  A practical approach to rate adaptation for multi-antenna systems , 2011, 2011 19th IEEE International Conference on Network Protocols.

[15]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[16]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[17]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[18]  John C. Bicket,et al.  Bit-rate selection in wireless networks , 2005 .

[19]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[20]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[21]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[22]  Alexandre Proutière,et al.  Optimal Rate Sampling in 802.11 systems , 2013, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[23]  Kevin C. Almeroth,et al.  Joint rate and channel width adaptation for 802.11 MIMO wireless networks , 2013, 2013 IEEE International Conference on Sensing, Communications and Networking (SECON).

[24]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[25]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[26]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[27]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[28]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[29]  Satyen Kale,et al.  Multiarmed Bandits With Limited Expert Advice , 2013, COLT.

[30]  Ron Kohavi Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 Years , 2015, KDD.

[31]  Songwu Lu,et al.  MIMO rate adaptation in 802.11n wireless networks , 2010, MobiCom.

[32]  Atilla Eryilmaz,et al.  Stochastic bandits with side observations on networks , 2014, SIGMETRICS '14.

[33]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.