论文信息 - The Epoch-Greedy algorithm for contextual multi-armed bandits

The Epoch-Greedy algorithm for contextual multi-armed bandits

We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T2/3S1/3) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning.

J. Langford | Tong Zhang

[1] J. Heckman. Sample selection bias as a specification error , 1979 .

[2] W. Greene. Sample Selection Bias as a Specification Error: Comment , 1981 .

[3] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[4] S. Yakowitz,et al. Machine learning and nonparametric bandit theory , 1995, IEEE Trans. Autom. Control..

[5] Yishay Mansour,et al. Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[6] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7] H. Vincent Poor,et al. Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[8] Chris Mesterharm,et al. Experience-efficient learning in associative bandit problems , 2006, ICML.

[9] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[10] Deepayan Chakrabarti,et al. Bandits for Taxonomies: A Model-based Approach , 2007, SDM.

[11] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .