论文信息 - Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards

Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards

We study the contextual bandit problem with linear payoff function. In the traditional contextual bandit problem, the algorithm iteratively chooses an action based on the observed context, and immediately receives a reward for the chosen action. Motivated by a practical need in many applications, we study the design of algorithms under the piled-reward setting, where the rewards are received as a pile instead of immediately. We present how the Linear Upper Confidence Bound LinUCB algorithm for the traditional problem can be naively applied under the piled-reward setting, and prove its regret bound. Then, we extend LinUCB to a novel algorithm, called Linear Upper Confidence Bound with Pseudo Reward LinUCBPR, which digests the observed contexts to choose actions more strategically before the piled rewards are received. We prove that LinUCBPR can match LinUCB in the regret bound under the piled-reward setting. Experiments on the artificial and real-world datasets demonstrate the strong performance of LinUCBPR in practice.

Hsuan-Tien Lin | Kuan-Hao Huang | Hsuan-Tien Lin | Kuan-Hao Huang

[1] Sudipto Guha,et al. Multiarmed Bandit Problems with Delayed Feedback , 2010, 1011.1161.

[2] András György,et al. Online Learning under Delayed Feedback , 2013, ICML.

[3] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[4] Zoran Popovic,et al. The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits , 2015, AAAI.

[5] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[6] John Langford,et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[7] Hsuan-Tien Lin,et al. Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions , 2014, ACML.

[8] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[9] H. Vincent Poor,et al. Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[10] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[11] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[12] John Langford,et al. Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[13] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.