Multi-armed Bandit with Additional Observations

We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. We propose algorithms that are asymptotic-optimal and order-optimal in their regrets under the settings of stochastic and adversarial rewards, respectively.

[1]  Alexandre Proutière,et al.  Optimal Rate Sampling in 802.11 systems , 2013, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[2]  R. Srikant,et al.  Bandits with Budgets , 2015, SIGMETRICS.

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[5]  Kevin C. Almeroth,et al.  Joint rate and channel width adaptation for 802.11 MIMO wireless networks , 2013, 2013 IEEE International Conference on Sensing, Communications and Networking (SECON).

[6]  Filip Radlinski,et al.  Ranked bandits in metric spaces: learning diverse rankings over large document collections , 2013, J. Mach. Learn. Res..

[7]  Jon M. Kleinberg,et al.  Incentivizing exploration , 2014, EC.

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[9]  Ron Kohavi Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 Years , 2015, KDD.

[10]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[11]  Satyen Kale,et al.  Multiarmed Bandits With Limited Expert Advice , 2013, COLT.

[12]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[13]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[14]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[15]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16]  Koby Crammer,et al.  Prediction with Limited Advice and Multiarmed Bandits with Paid Observations , 2014, ICML.

[17]  J. J. Garcia-Luna-Aceves,et al.  A practical approach to rate adaptation for multi-antenna systems , 2011, 2011 19th IEEE International Conference on Network Protocols.

[18]  John C. Bicket,et al.  Bit-rate selection in wireless networks , 2005 .

[19]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[20]  Tao Qin,et al.  Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising , 2013, NIPS.

[21]  Songwu Lu,et al.  MIMO rate adaptation in 802.11n wireless networks , 2010, MobiCom.

[22]  Atilla Eryilmaz,et al.  Stochastic bandits with side observations on networks , 2014, SIGMETRICS '14.

[23]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[24]  Yishay Mansour,et al.  Bayesian Incentive-Compatible Bandit Exploration , 2018 .

[25]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[26]  Jean-Yves Audibert,et al.  Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[27]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[28]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[29]  Deepak S. Turaga,et al.  Budgeted Prediction with Expert Advice , 2015, AAAI.

[30]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[31]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).