A Linear Response Bandit Problem

We consider a two–armed bandit problem which involves sequential sampling from two non-homogeneous populations. The response in each is determined by a random covariate vector and a vector of parameters whose values are not known a priori. The goal is to maximize cumulative expected reward. We study this problem in a minimax setting, and develop rate-optimal polices that combine myopic action based on least squares estimates with a suitable “forced sampling” strategy. It is shown that the regret grows logarithmically in the time horizon n and no policy can achieve a slower growth rate over all feasible problem instances. In this setting of linear response bandits, the identity of the sub-optimal action changes with the values of the covariate vector, and the optimal policy is subject to sampling from the inferior population at a rate that grows like n.

[1]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[2]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[3]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[4]  Tze Leung Lai,et al.  Asymptotic Solutions of Bandit Problems , 1988 .

[5]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[6]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[7]  J. Sarkar One-Armed Bandit Problems with Covariates , 1991 .

[8]  J. Ginebra,et al.  Response surface bandits , 1995 .

[9]  R. Gill,et al.  Applications of the van Trees inequality : a Bayesian Cramr-Rao bound , 1995 .

[10]  S. Yakowitz,et al.  Machine learning and nonparametric bandit theory , 1995, IEEE Trans. Autom. Control..

[11]  T. Lai SEQUENTIAL ANALYSIS: SOME CLASSICAL PROBLEMS AND NEW CHALLENGES , 2001 .

[12]  C. G. Gooley,et al.  Dynamic Customization of Marketing Messages in Interactive Media , 2001 .

[13]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[14]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[15]  Yuhong Yang,et al.  RANDOMIZED ALLOCATION WITH NONPARAMETRIC ESTIMATION FOR A MULTI-ARMED BANDIT PROBLEM WITH COVARIATES , 2002 .

[16]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[17]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[19]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[20]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[21]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[22]  A. Tsybakov,et al.  Gap-free Bounds for Stochastic Multi-Armed Bandit , 2008 .

[23]  John N. Tsitsiklis,et al.  A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[24]  A. Zeevi,et al.  Woodroofe's One-Armed Bandit Problem Revisited , 2009, 0909.0119.

[25]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[26]  Martin Pál,et al.  Contextual Multi-Armed Bandits , 2010, AISTATS.

[27]  Assaf J. Zeevi,et al.  A Note on Performance Limitations in Bandit Problems With Side Information , 2011, IEEE Transactions on Information Theory.