论文信息 - Tighter Bounds for Multi-Armed Bandits with Expert Advice

Tighter Bounds for Multi-Armed Bandits with Expert Advice

Bandit problems are a classic way of formulating exploration versus exploitation tradeoffs. Auer et al. [ACBFS02] introduced the EXP4 algorithm, which explicitly decouples the set of A actions which can be taken in the world from the set of M experts (general strategies for selecting actions) with which we wish to be competitive. Auer et al. show that EXP4 has expected cumulative regret bounded by O( √ TA logM), where T is the total number of rounds. This bound is attractive when the number of actions is small compared to the number of experts, but poor when the situation is reversed. In this paper we introduce a new algorithm, similar in spirit to EXP4, which has a bound ofO( √ TS logM). The S parameter measures the extent to which expert recommendations agree; we always have S ≤ min {A,M}. We discuss practical applications that arise in the contextual bandits setting, including sponsored search keyword advertising. In these problems, common context means many actions are irrelevant on any given round, and so S << min {A,M}, implying our bounds offer a significant improvement. The key to our new algorithm is a linear-programing-based exploration strategy that is optimal in a certain sense. In addition to proving tighter bounds, we run experiments on real-world data from an online advertising problem, and demonstrate that our refined exploration strategy leads to significant improvements over known approaches.

Matthew J. Streeter | H. Brendan McMahan | H. B. McMahan

[1] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[2] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[3] Yishay Mansour,et al. Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[4] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5] Ran El-Yaniv,et al. Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[6] Sandeep Pandey,et al. Handling Advertisements of Unknown Quality in Search Advertising , 2006, NIPS.

[7] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[8] Nimrod Megiddo,et al. Combining expert advice in reactive environments , 2006, JACM.

[9] Rica Gonen,et al. An incentive-compatible multi-armed bandit mechanism , 2007, PODC '07.

[10] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[11] Thomas P. Hayes,et al. The Price of Bandit Information for Online Optimization , 2007, NIPS.

[12] Filip Radlinski,et al. Mortal Multi-Armed Bandits , 2008, NIPS.

[13] John Langford,et al. Maintaining Equilibria During Exploration in Sponsored Search Auctions , 2010, Algorithmica.

[14] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .