论文信息 - Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits - 字舞流文

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of K actions in response to the observed context, and observes the reward only for that action. Our method assumes access to an oracle for solving fully supervised cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only O(√KT) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We conduct a proof-of-concept experiment which demonstrates the excellent computational and statistical performance of (an online variant of) our algorithm relative to several strong baselines.

John Langford | Lihong Li | Robert E. Schapire | Alekh Agarwal | Satyen Kale | Daniel J. Hsu | R. Schapire | J. Langford | Lihong Li | Satyen Kale | Alekh Agarwal

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] Robert E. Schapire,et al. Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[3] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[4] John Langford,et al. Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[5] Vladimir Vovk,et al. Predicting nearly as well as the best pruning of a decision tree through dynamic programming scheme , 2001, Theor. Comput. Sci..

[6] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[7] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[8] Jerry Alan Fails,et al. Interactive machine learning , 2003, IUI '03.

[9] Yiming Yang,et al. RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[10] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11] J. Langford,et al. The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[12] Matthew J. Streeter,et al. Tighter Bounds for Multi-Armed Bandits with Expert Advice , 2009, COLT.

[13] John Langford,et al. The offset tree for learning with partial labels , 2008, KDD.

[14] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[15] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[16] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[17] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[18] John Langford,et al. Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[19] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[20] Lihong Li,et al. Generalized Thompson Sampling for Contextual Bandits , 2013, ArXiv.