Robust Multiarmed Bandit Problems

The multiarmed bandit problem is a popular framework for studying the exploration versus exploitation trade-off. Recent applications include dynamic assortment design, Internet advertising, dynamic pricing, and the control of queues. The standard mathematical formulation for a bandit problem makes the strong assumption that the decision maker has a full characterization of the joint distribution of the rewards, and that “arms” under this distribution are independent. These assumptions are not satisfied in many applications, and the out-of-sample performance of policies that optimize a misspecified model can be poor. Motivated by these concerns, we formulate a robust bandit problem in which a decision maker accounts for distrust in the nominal model by solving a worst-case problem against an adversary (“nature”) who has the ability to alter the underlying reward distribution and does so to minimize the decision maker’s expected total profit. Structural properties of the optimal worst-case policy are characterized by using the robust Bellman (dynamic programming) equation, and arms are shown to be no longer independent under nature’s worst-case response. One implication of this is that index policies are not optimal for the robust problem, and we propose, as an alternative, a robust version of the Gittins index. Performance bounds for the robust Gittins index are derived by using structural properties of the value function together with ideas from stochastic dynamic programming duality. We also investigate the performance of the robust Gittins index policy when applied to a Bayesian webpage design problem. In the presence of model misspecification, numerical experiments show that the robust Gittins index policy not only outperforms the classical Gittins index policy, but also substantially reduces the variability in the out-of-sample performance. This paper was accepted by Dimitris Bertsimas, optimization.

[1]  J.N. Tsitsiklis,et al.  A structured multiarmed bandit problem and the greedy policy , 2008, 2008 47th IEEE Conference on Decision and Control.

[2]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[3]  Arkadi Nemirovski,et al.  Robust solutions of uncertain linear programs , 1999, Oper. Res. Lett..

[4]  P. Whittle Risk-sensitive linear/quadratic/gaussian control , 1981, Advances in Applied Probability.

[5]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[6]  Andrew E. B. Lim,et al.  Model Uncertainty, Robust Optimization and Learning , 2006 .

[7]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[8]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[9]  Martin Schneider,et al.  Recursive multiple-priors , 2003, J. Econ. Theory.

[10]  Laurent El Ghaoui,et al.  Robust Solutions to Least-Squares Problems with Uncertain Data , 1997, SIAM J. Matrix Anal. Appl..

[11]  J. Lynch,et al.  A weak convergence approach to the theory of large deviations , 1997 .

[12]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[13]  Larry G. Epstein,et al.  Learning Under Ambiguity , 2002 .

[14]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[15]  Peng Sun,et al.  Information Relaxations and Duality in Stochastic Dynamic Programs , 2010, Oper. Res..

[16]  Melvyn Sim,et al.  The Price of Robustness , 2004, Oper. Res..

[17]  Warren B. Powell,et al.  The Knowledge Gradient Algorithm for a General Class of Online Learning Problems , 2012, Oper. Res..

[18]  Andrew E. B. Lim,et al.  Robust Portfolio Choice with Learning in the Framework of Regret: Single-Period Case , 2012, Manag. Sci..

[19]  Ian R. Petersen,et al.  Minimax optimal control of stochastic uncertain systems with relative entropy constraints , 2000, IEEE Trans. Autom. Control..

[20]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[21]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[22]  José Niño-Mora,et al.  Towards minimum loss job routing to parallel heterogeneous multiserver queues via index policies , 2012, Eur. J. Oper. Res..

[23]  P. Whittle A risk-sensitive maximum principle: the case of imperfect state observation , 1991 .

[24]  L El Ghaoui,et al.  ROBUST SOLUTIONS TO LEAST-SQUARE PROBLEMS TO UNCERTAIN DATA MATRICES , 1997 .

[25]  L. C. G. Rogers,et al.  Pathwise Stochastic Optimal Control , 2007, SIAM J. Control. Optim..

[26]  Dimitris Bertsimas,et al.  Conservation laws, extended polymatroids and multi-armed bandit problems: a unified approach to ind exable systems , 2011, IPCO.

[27]  A Ben Tal,et al.  ROBUST SOLUTIONS TO UNCERTAIN PROGRAMS , 1999 .

[28]  Roy H. Kwon,et al.  Portfolio selection under model uncertainty: a penalized moment-based optimization approach , 2013, J. Glob. Optim..

[29]  J. Tsitsiklis A lemma on the multiarmed bandit problem , 1986 .

[30]  Wolfgang J. Runggaldier,et al.  Connections between stochastic control and dynamic games , 1996, Math. Control. Signals Syst..

[31]  Enlu Zhou,et al.  Information Relaxation and Dual Formulation of Controlled Markov Diffusions , 2013, IEEE Transactions on Automatic Control.

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  P. Whittle A risk-sensitive maximum principle , 1990 .

[34]  Carri W. Chan,et al.  Stochastic Depletion Problems: Effective Myopic Policies for a Class of Dynamic Optimization Problems , 2008, Math. Oper. Res..

[35]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[36]  James E. Smith,et al.  Optimal Sequential Exploration: Bandits, Clairvoyants, and Wildcats , 2013, Oper. Res..

[37]  Moshe Babaioff,et al.  Characterizing truthful multi-armed bandit mechanisms: extended abstract , 2008, EC '09.

[38]  Rhodes,et al.  Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games , 1973 .

[39]  Onésimo Hernández-Lerma,et al.  Minimax Control of Discrete-Time Stochastic Systems , 2002, SIAM J. Control. Optim..

[40]  Deepayan Chakrabarti,et al.  Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[41]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[42]  Lars Peter Hansen,et al.  Recursive Robust Estimation and Control Without Commitment , 2007, J. Econ. Theory.

[43]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[44]  Lars Peter Hansen,et al.  Robust Estimation and Control without Commitment , 2014 .

[45]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[46]  David B. Shmoys,et al.  Dynamic Assortment Optimization with a Multinomial Logit Choice Model and Capacity Constraint , 2010, Oper. Res..

[47]  Andrew E. B. Lim,et al.  Relative Entropy, Exponential Utility, and Robust Dynamic Pricing , 2007, Oper. Res..

[48]  Andrew E. B. Lim,et al.  Linear-quadratic control and information relaxations , 2012, Oper. Res. Lett..

[49]  Arkadi Nemirovski,et al.  Robust solutions of Linear Programming problems contaminated with uncertain data , 2000, Math. Program..

[50]  P. Whittle Multi‐Armed Bandits and the Gittins Index , 1980 .

[51]  Felipe Caro,et al.  Robust Control of the Multi-Armed Bandit Problem , 2014 .

[52]  Lars Peter Hansen,et al.  Robust estimation and control under commitment , 2005, J. Econ. Theory.

[53]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[54]  Andrew E. B. Lim,et al.  ROBUST ASSET ALLOCATION WITH BENCHMARKED OBJECTIVES , 2008 .

[55]  Arkadi Nemirovski,et al.  Robust Convex Optimization , 1998, Math. Oper. Res..

[56]  Felipe Caro,et al.  Dynamic Assortment with Demand Learning for Seasonal Consumer Goods , 2007, Manag. Sci..