Tighter Bounds for Multi-Armed Bandits with Expert Advice

Bandit problems are a classic way of formulating exploration versus exploitation tradeoffs. Auer et al. [ACBFS02] introduced the EXP4 algorithm, which explicitly decouples the set of A actions which can be taken in the world from the set of M experts (general strategies for selecting actions) with which we wish to be competitive. Auer et al. show that EXP4 has expected cumulative regret bounded by O( √ TA logM), where T is the total number of rounds. This bound is attractive when the number of actions is small compared to the number of experts, but poor when the situation is reversed. In this paper we introduce a new algorithm, similar in spirit to EXP4, which has a bound ofO( √ TS logM). The S parameter measures the extent to which expert recommendations agree; we always have S ≤ min {A,M}. We discuss practical applications that arise in the contextual bandits setting, including sponsored search keyword advertising. In these problems, common context means many actions are irrelevant on any given round, and so S << min {A,M}, implying our bounds offer a significant improvement. The key to our new algorithm is a linear-programing-based exploration strategy that is optimal in a certain sense. In addition to proving tighter bounds, we run experiments on real-world data from an online advertising problem, and demonstrate that our refined exploration strategy leads to significant improvements over known approaches.