Exploiting the Natural Exploration In Contextual Bandits

The contextual bandit literature has traditionally focused on algorithms that address the exploration-exploitation trade-off. In particular, greedy policies that exploit current estimates without any exploration may be sub-optimal in general. However, exploration-free greedy policies are desirable in many practical settings where exploration may be prohibitively costly or unethical (e.g. clinical trials). We prove that, for a general class of context distributions, the greedy policy benefits from a natural exploration obtained from the varying contexts and becomes asymptotically rate-optimal for the two-armed contextual bandit. Through simulations, we also demonstrate that these results generalize to more than two arms if the dimension of contexts is large enough. Motivated by these results, we introduce Greedy-First, a new algorithm that uses only observed contexts and rewards to determine whether to follow a greedy policy or to explore. We prove that this algorithm is asymptotically optimal without any additional assumptions on the distribution of contexts or the number of arms. Extensive simulations demonstrate that Greedy-First successfully reduces experimentation and outperforms existing (exploration-based) contextual bandit algorithms such as Thompson sampling, UCB, or $\epsilon$-greedy.

[1]  J. Tropp User-Friendly Tail Bounds for Matrix Martingales , 2011 .

[2]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[3]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[4]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[5]  Assaf J. Zeevi,et al.  Chasing Demand: Learning and Earning in a Changing Environment , 2016, Math. Oper. Res..

[6]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[7]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.

[8]  Renato Paes Leme,et al.  Feature-based Dynamic Pricing , 2016, EC.

[9]  Vianney Perchet,et al.  Online learning in repeated auctions , 2015, COLT.

[10]  A. Zeevi,et al.  Woodroofe's One-Armed Bandit Problem Revisited , 2009, 0909.0119.

[11]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[12]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[13]  Assaf J. Zeevi,et al.  On Incomplete Learning and Certainty-Equivalence Control , 2017, Oper. Res..

[14]  Vivek F. Farias,et al.  Optimistic Gittins Indices , 2016, NIPS.

[15]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[16]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[17]  Josef Broder,et al.  Dynamic Pricing Under a General Parametric Choice Model , 2012, Oper. Res..

[18]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[19]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[20]  John N. Tsitsiklis,et al.  A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[21]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[22]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[23]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[24]  Philippe Rigollet,et al.  Nonparametric Bandits with Covariates , 2010, COLT.

[25]  Stephen E. Chick,et al.  Bayesian Sequential Learning for Clinical Trials of Multiple Correlated Medical Interventions , 2018, Manag. Sci..

[26]  Adel Javanmard Perishability of Data: Dynamic Pricing under Varying-Coefficient Models , 2017, J. Mach. Learn. Res..

[27]  Mohsen Bayati,et al.  Dynamic Pricing with Demand Covariates , 2016, 1604.07463.

[28]  Adel Javanmard,et al.  Dynamic Pricing in High-Dimensions , 2016, J. Mach. Learn. Res..

[29]  Thorsten Gerber,et al.  Handbook Of Mathematical Functions , 2016 .

[30]  T. Lai,et al.  Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems , 1982 .

[31]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[32]  Edward S. Kim,et al.  The BATTLE trial: personalizing therapy for lung cancer. , 2011, Cancer discovery.

[33]  Vashist Avadhanula,et al.  MNL-Bandit: A Dynamic Learning Approach to Assortment Selection , 2017, Oper. Res..

[34]  Carlos Riquelme,et al.  Online Active Linear Regression via Thresholding , 2016, AAAI.

[35]  Bert Zwart,et al.  Simultaneously Learning and Optimizing Using Controlled Variance Pricing , 2014, Manag. Sci..

[36]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[37]  Tor Lattimore,et al.  Bounded Regret for Finite-Armed Structured Bandits , 2014, NIPS.

[38]  Nathan Kallus,et al.  Recursive Partitioning for Personalization using Observational Data , 2016, ICML.

[39]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[40]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[41]  Nathan Kallus,et al.  Policy Evaluation and Optimization with Continuous Treatments , 2018, AISTATS.

[42]  Kani Chen,et al.  Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs , 1999 .

[43]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[44]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[45]  Sanjeev R. Kulkarni,et al.  Arbitrary side observations in bandit problems , 2005, Adv. Appl. Math..

[46]  Assaf J. Zeevi,et al.  Dynamic Pricing with an Unknown Demand Model: Asymptotically Optimal Semi-Myopic Policies , 2014, Oper. Res..

[47]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[48]  Arnoud V. den Boer Tracking the market: Dynamic pricing and learning in a changing environment , 2015, Eur. J. Oper. Res..

[49]  Bert Zwart,et al.  Dynamic Pricing and Learning with Finite Inventories , 2013, Oper. Res..

[50]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[51]  K. Narendra,et al.  Persistent excitation in adaptive systems , 1987 .

[52]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[53]  J. Sarkar One-Armed Bandit Problems with Covariates , 1991 .

[54]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[55]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[56]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[57]  Nhan T. Nguyen Model-Reference Adaptive Control , 2018 .

[58]  R. Altman,et al.  Estimation of the warfarin dose with clinical and pharmacogenetic data. , 2009, The New England journal of medicine.

[59]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.