Contextual Bandits with Linear Payoff Functions

In this paper we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d dimensional feature vectors, we prove an O (√ Td ln(KT ln(T )/δ) ) regret bound that holds with probability 1− δ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm for this problem. We also prove a lower bound of Ω( √ Td) for this setting, matching the upper bound up to logarithmic factors.

[1]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[2]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[3]  Leslie Pack Kaelbling,et al.  Associative Reinforcement Learning: Functions in k-DNF , 1994, Machine Learning.

[4]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[5]  Deepak Agarwal,et al.  Online Models for Content Optimization , 2008, NIPS.

[6]  Chris Mesterharm,et al.  Experience-efficient learning in associative bandit problems , 2006, ICML.

[7]  Dimitris K. Tasoulis,et al.  Simulation Studies of Multi-armed Bandits with Covariates (Invited Paper) , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[8]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[9]  J. Sarkar One-Armed Bandit Problems with Covariates , 1991 .

[10]  Philip M. Long,et al.  Reinforcement Learning with Immediate Rewards and Linear Hypotheses , 2003, Algorithmica.

[11]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[12]  Thomas J. Walsh,et al.  Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[13]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[14]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[15]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[16]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[17]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.