The value of information in multi-armed bandits with exponentially distributed rewards

Abstract We consider a class of multi-armed bandit problems where the reward obtained by pulling an arm is drawn from an exponential distribution whose parameter is unknown. A Bayesian model with independent gamma priors is used to represent our beliefs and uncertainty about the exponential parameters. We derive a precise expression for the marginal value of information in this problem, which allows us to create a new knowledge gradient (KG) policy for making decisions. The policy is practical and easy to implement, making a case for value of information as a general approach to optimal learning problems with many different types of learning models.

[1]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[2]  Jürgen Branke,et al.  Sequential Sampling to Myopically Maximize the Expected Value of Information , 2010, INFORMS J. Comput..

[3]  Warren B. Powell,et al.  Paradoxes in Learning and the Marginal Value of Information , 2010, Decis. Anal..

[4]  D. Berry,et al.  Optimal designs for clinical trials with dichotomous responses. , 1985, Statistics in medicine.

[5]  Warren B. Powell,et al.  The knowledge gradient algorithm for online subset selection , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[6]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[7]  S. Gupta,et al.  Bayesian look ahead one-stage sampling allocations for selection of the best population , 1996 .

[8]  Yi-Ching Yao Some results on the Gittins index for a normal reward process , 2007, math/0702831.

[9]  Warren B. Powell,et al.  A Monte Carlo knowledge gradient method for learning abatement potential of emissions reduction technologies , 2009, Proceedings of the 2009 Winter Simulation Conference (WSC).

[10]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[11]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[12]  Peter I. Frazier,et al.  The conjunction of the knowledge gradient and the economic approach to simulation selection , 2009, Proceedings of the 2009 Winter Simulation Conference (WSC).

[13]  Warren B. Powell,et al.  The Knowledge-Gradient Policy for Correlated Normal Beliefs , 2009, INFORMS J. Comput..

[14]  Warren B. Powell,et al.  The Knowledge Gradient Algorithm for a General Class of Online Learning Problems , 2012, Oper. Res..

[15]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[16]  T. Lai,et al.  Optimal learning and experimentation in bandit problems , 2000 .

[17]  Warren B. Powell,et al.  On the robustness of a one-period look-ahead policy in multi-armed bandit problems , 2010, ICCS.

[18]  Warren B. Powell,et al.  A Knowledge-Gradient Policy for Sequential Information Collection , 2008, SIAM J. Control. Optim..

[19]  J. Gittins,et al.  The Learning Component of Dynamic Allocation Indices , 1992 .

[20]  Richard S. Sutton,et al.  Dimensions of Reinforcement Learning , 1998 .

[21]  S. Gupta,et al.  Bayesian look ahead one stage sampling allocations for selecting the largest normal mean , 1994 .

[22]  M. Degroot Optimal Statistical Decisions , 1970 .

[23]  T. Lai,et al.  Time series and related topics : in memory of Ching-Zong Wei , 2007, math/0703053.

[24]  Warren B. Powell,et al.  Information Collection on a Graph , 2011, Oper. Res..

[25]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[26]  Evan L. Porteus,et al.  Stalking Information: Bayesian Inventory Management with Unobserved Lost Sales , 1999 .

[27]  J. Gittins,et al.  A dynamic allocation index for the discounted multiarmed bandit problem , 1979 .

[28]  Peter Key,et al.  On the Bayesian Steady Forecasting Model , 1981 .

[29]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[30]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[31]  Benjamin Van Roy,et al.  Dynamic Pricing with a Prior on Market Response , 2010, Oper. Res..

[32]  Stephen E. Chick,et al.  Economic Analysis of Simulation Selection Problems , 2009, Manag. Sci..