论文信息 - Simulation Studies of Multi-armed Bandits with Covariates (Invited Paper)

Simulation Studies of Multi-armed Bandits with Covariates (Invited Paper)

We evaluate the performance of a number of action-selection methods on the multi-armed bandit problem with covariates. We resort to simulations because our primary concern is the speed with which the different methods identify the optimal policy, and not their asymptotic behaviour. The experimental results show that the performance of the ε-greedy methods is robust, while the interval estimation strategies achieve the fastest learning of the optimal policy. We propose a metric to quantify the difficulty of a multi-armed bandit problem with covariates and show that there is a trade-off between the satisfaction of the different performance measures.

[1] Q. Stout,et al. Bandit Strategies for Ethical Sequential Allocation , 1992 .

[2] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[4] Leslie Pack Kaelbling,et al. Learning in embedded systems , 1993 .

[5] Yuhong Yang,et al. RANDOMIZED ALLOCATION WITH NONPARAMETRIC ESTIMATION FOR A MULTI-ARMED BANDIT PROBLEM WITH COVARIATES , 2002 .

[6] Baruch Awerbuch,et al. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[7] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[8] Mehryar Mohri,et al. Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[9] David H. Wolpert,et al. Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[10] Malik Beshir Malik,et al. Applied Linear Regression , 2005, Technometrics.