论文信息 - Contextual Bandits with Linear Payoff Functions

Contextual Bandits with Linear Payoff Functions

In this paper we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d dimensional feature vectors, we prove an O (√ Td ln(KT ln(T )/δ) ) regret bound that holds with probability 1− δ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of Ω( √ Td) for this setting, matching the upper bound up to logarithmic factors.

[1] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[2] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[3] Leslie Pack Kaelbling,et al. Associative Reinforcement Learning: Functions in k-DNF , 1994, Machine Learning.

[4] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[5] Deepak Agarwal,et al. Online Models for Content Optimization , 2008, NIPS.

[6] Chris Mesterharm,et al. Experience-efficient learning in associative bandit problems , 2006, ICML.

[7] Dimitris K. Tasoulis,et al. Simulation Studies of Multi-armed Bandits with Covariates (Invited Paper) , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[8] J. Langford,et al. The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[9] J. Sarkar. One-Armed Bandit Problems with Covariates , 1991 .

[10] Philip M. Long,et al. Reinforcement Learning with Immediate Rewards and Linear Hypotheses , 2003, Algorithmica.

[11] M. Woodroofe. A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[12] Thomas J. Walsh,et al. Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[13] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[14] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[15] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[16] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[17] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.