An Unbiased, Data-Driven, Offline Evaluation Method of Contextual Bandit Algorithms

[1]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[2]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[5]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[6]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[7]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[8]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[9]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[10]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[11]  Bee-Chung Chen,et al.  Explore/Exploit Schemes for Web Content Optimization , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[12]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[13]  Brian Tanner,et al.  RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[16]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[17]  Chris Mesterharm,et al.  Experience-efficient learning in associative bandit problems , 2006, ICML.

[18]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[19]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[20]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[21]  Leslie Pack Kaelbling,et al.  Associative Reinforcement Learning: Functions in k-DNF , 1994, Machine Learning.

[22]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[23]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[24]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .