论文信息 - A Linear Response Bandit Problem

A Linear Response Bandit Problem

We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine myopic action based on least squares estimates with a suitable “forced sampling” strategy. It is shown that the regret grows logarithmically in the time horizon n and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear response bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is subject to sampling from the inferior population at a rate that grows like n.

A. Zeevi | A. Goldenshluger

[1] M. Woodroofe. A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[2] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[3] T. Lai. Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[4] Tze Leung Lai,et al. Asymptotic Solutions of Bandit Problems , 1988 .

[5] Christian M. Ernst,et al. Multi-armed Bandit Allocation Indices , 1989 .

[6] V. N. Bogaevski,et al. Matrix Perturbation Theory , 1991 .

[7] J. Sarkar. One-Armed Bandit Problems with Covariates , 1991 .

[8] J. Ginebra,et al. Response surface bandits , 1995 .

[9] R. Gill,et al. Applications of the van Trees inequality : a Bayesian Cramr-Rao bound , 1995 .

[10] S. Yakowitz,et al. Machine learning and nonparametric bandit theory , 1995, IEEE Trans. Autom. Control..

[11] T. Lai. SEQUENTIAL ANALYSIS: SOME CLASSICAL PROBLEMS AND NEW CHALLENGES , 2001 .

[12] C. G. Gooley,et al. Dynamic Customization of Marketing Messages in Interactive Media , 2001 .

[13] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[14] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[15] Yuhong Yang,et al. RANDOMIZED ALLOCATION WITH NONPARAMETRIC ESTIMATION FOR A MULTI-ARMED BANDIT PROBLEM WITH COVARIATES , 2002 .

[16] A. Tsybakov,et al. Optimal aggregation of classifiers in statistical learning , 2003 .

[17] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18] H. Vincent Poor,et al. Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[19] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[20] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[21] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[22] A. Tsybakov,et al. Gap-free Bounds for Stochastic Multi-Armed Bandit , 2008 .

[23] John N. Tsitsiklis,et al. A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[24] A. Zeevi,et al. Woodroofe's One-Armed Bandit Problem Revisited , 2009, 0909.0119.

[25] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[26] Martin Pál,et al. Contextual Multi-Armed Bandits , 2010, AISTATS.

[27] Assaf J. Zeevi,et al. A Note on Performance Limitations in Bandit Problems With Side Information , 2011, IEEE Transactions on Information Theory.