We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achievable if the other rewards on the non-selected actions can also be revealed to the learner, we propose a new framework that feeds the learner with pseudo-rewards, which are estimates of the rewards on the non-selected actions. We argue that the pseudo-rewards should better contain over-estimates of the true rewards, and propose a forgetting mechanism to decrease the negative influence of the over-estimation in the long run. Then, we couple the two key ideas above with the linear upper confidence bound (LinUCB) algorithm to design a novel algorithm called linear pseudo-reward upper confidence bound (LinPRUCB). We prove that LinPRUCB shares the same order of regret bound to LinUCB, while enjoying the practical observation of faster reward-gathering in the earlier iterations. Experiments on artificial and real-world data sets justify that LinPRUCB is competitive to and sometimes even better than LinUCB. Furthermore, we couple LinPRUCB with a special parameter to formalize a new algorithm that yields faster computation in updating the internal models while keeping the promising practical performance. The two properties match the real-world needs of the contextual bandit problem and make the new algorithm a favorable choice in practice.
[1]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[2]
Wei Chu,et al.
A contextual-bandit approach to personalized news article recommendation
,
2010,
WWW '10.
[3]
H. Vincent Poor,et al.
Bandit problems with side observations
,
2005,
IEEE Transactions on Automatic Control.
[4]
Wei Chu,et al.
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms
,
2010,
WSDM '11.
[5]
Peter Auer,et al.
Finite-time Analysis of the Multiarmed Bandit Problem
,
2002,
Machine Learning.
[6]
Hsuan-Tien Lin,et al.
Pseudo-reward Algorithms for Contextual Bandits with
,
2014
.
[7]
Peter Auer,et al.
Using Confidence Bounds for Exploitation-Exploration Trade-offs
,
2003,
J. Mach. Learn. Res..
[8]
John Langford,et al.
The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
,
2007,
NIPS.
[9]
T. L. Lai Andherbertrobbins.
Asymptotically Efficient Adaptive Allocation Rules
,
2022
.
[10]
Hsuan-Tien Lin,et al.
Boosting with Online Binary Learners for the Multiclass Bandit Problem
,
2014,
ICML.
[11]
Peter Auer,et al.
The Nonstochastic Multiarmed Bandit Problem
,
2002,
SIAM J. Comput..
[12]
Ambuj Tewari,et al.
Efficient bandit algorithms for online multiclass prediction
,
2008,
ICML '08.
[13]
Wei Chu,et al.
Contextual Bandits with Linear Payoff Functions
,
2011,
AISTATS.