Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits

We study the problem of dynamic batch learning in high-dimensional sparse linear contextual bandits, where a decision maker, under a given maximum-number-of-batch constraint and only able to observe rewards at the end of each batch, can dynamically decide how many individuals to include in the next batch (at the end of the current batch) and what personalized action-selection scheme to adopt within each batch. Such batch constraints are ubiquitous in a variety of practical contexts, including personalized product offerings in marketing and medical treatment selection in clinical trials. We characterize the fundamental learning limit in this problem via a regret lower bound and provide a matching upper bound (up to log factors), thus prescribing an optimal scheme for this problem. To the best of our knowledge, our work provides the first inroad into a theoretical understanding of dynamic batch learning in high-dimensional sparse linear contextual bandits. Notably, even a special case of our result (when no batch constraint is present) yields the first minimax optimal $\tilde{O}(\sqrt{s_0T})$ regret bound for standard online learning in high-dimensional linear contextual bandits (for the no-margin case), where $s_0$ is the sparsity parameter (or an upper bound thereof) and $T$ is the learning horizon. This result (both that $\tilde{O}(\sqrt{s_0 T})$ is achievable and that $\Omega(\sqrt{s_0 T})$ is a lower bound) appears to be unknown in the emerging literature of high-dimensional contextual bandits.

[1]  Wallace J. Hopp,et al.  Big Data and the Precision Medicine Revolution , 2018, Production and Operations Management.

[2]  Zachary C. Lipton,et al.  Rebounding Bandits for Modeling Satiation Effects , 2020, ArXiv.

[3]  Mark Braverman,et al.  Data-Driven Decisions for Reducing Readmissions for Heart Failure: General Methodology and Case Study , 2014, PloS one.

[4]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[5]  Edward S. Kim,et al.  The BATTLE trial: personalizing therapy for lung cancer. , 2011, Cancer discovery.

[6]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[7]  Gi-Soo Kim,et al.  Doubly-Robust Lasso Bandit , 2019, NeurIPS.

[8]  Assaf J. Zeevi,et al.  Optimal Dynamic Assortment Planning with Demand Learning , 2013, Manuf. Serv. Oper. Manag..

[9]  Yanjun Han,et al.  Sequential Batch Learning in Finite-Action Linear Contextual Bandits , 2020, ArXiv.

[10]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11]  David Simchi-Levi,et al.  Online Network Revenue Management Using Thompson Sampling , 2017, Oper. Res..

[12]  Khashayar Khosravi,et al.  Mostly Exploration-Free Algorithms for Contextual Bandits , 2017, Manag. Sci..

[13]  N. B. Keskin,et al.  Personalized Dynamic Pricing with Machine Learning: High Dimensional Features and Heterogeneous Elasticity , 2020 .

[14]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[15]  Osbert Bastani,et al.  Interpreting Predictive Models for Human-in-the-Loop Analytics , 2018 .

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[18]  Cynthia Rudin,et al.  The Big Data Newsvendor: Practical Insights from Machine Learning Analysis , 2013 .

[19]  Toru Kitagawa,et al.  Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice , 2015 .

[20]  David Gamarnik,et al.  Sparse High-Dimensional Linear Regression. Algorithmic Barriers and a Local Search Algorithm , 2017, 1711.04952.

[21]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[22]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[23]  Yonatan Dov Mintz,et al.  Evaluating Machine Learning–Based Automated Personalized Daily Step Goals Delivered Through a Mobile Phone App: Randomized Controlled Trial , 2018, JMIR mHealth and uHealth.

[24]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[25]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[26]  Jingrui He,et al.  Heterogeneous Representation Learning with Structured Sparsity Regularization , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[27]  Xue Wang,et al.  Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Convariates , 2018, ICML.

[28]  Vivek F. Farias,et al.  Learning Preferences with Side Information , 2019, Manag. Sci..

[29]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[30]  Nathan Kallus,et al.  Confounding-Robust Policy Improvement , 2018, NeurIPS.

[31]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[32]  Rémi Munos,et al.  Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit , 2012, AISTATS.

[33]  David A. Sontag,et al.  Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors , 2015, Big Data.

[34]  Yanjun Han,et al.  Batched Multi-armed Bandits Problem , 2019, NeurIPS.

[35]  Yijie Peng,et al.  Efficient Learning for Clustering and Optimizing Context-Dependent Designs , 2020, Operations Research.

[36]  Peter Jacko,et al.  Dynamic Priority Allocation in Restless Bandit Models: Designing simple and well-performing rules for dynamic and stochastic resource allocation problems , 2010 .

[37]  Anil Aswani,et al.  Behavioral analytics for myopic agents , 2017, Eur. J. Oper. Res..

[38]  David Simchi-Levi,et al.  Assortment Planning for Recommendations at Checkout under Inventory Constraints , 2016, Mathematics of Operations Research.

[39]  Michel Wedel,et al.  Challenges and opportunities in high-dimensional choice data analyses , 2008 .

[40]  Karthik Sridharan,et al.  BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[41]  Adam N. Elmachtoub,et al.  The Value of Personalized Pricing , 2018, Manag. Sci..

[42]  R. Altman,et al.  Estimation of the warfarin dose with clinical and pharmacogenetic data. , 2009, The New England journal of medicine.

[43]  Osbert Bastani,et al.  Interpreting Blackbox Models via Model Extraction , 2017, ArXiv.

[44]  Benjamin Van Roy,et al.  An Information-Theoretic Analysis of Thompson Sampling , 2014, J. Mach. Learn. Res..

[45]  Hamsa Bastani,et al.  Adaptive Clinical Trial Designs with Surrogates: When Should We Bother? , 2019, Manag. Sci..

[46]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[47]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[48]  Victor Chernozhukov,et al.  High Dimensional Sparse Econometric Models: An Introduction , 2011, 1106.5242.

[49]  David Gamarnik,et al.  High Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transtition , 2017, COLT.

[50]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[51]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[52]  Nan Liu,et al.  A rolling-horizon pharmacokinetic pharmacodynamic model for warfarin inpatients in transient clinical states. , 2016, Personalized medicine.

[53]  Itay Gurvich,et al.  Cross-Selling in a Call Center with a Heterogeneous Customer Population , 2009, Oper. Res..

[54]  Thomas Jaki,et al.  A Bayesian adaptive design for clinical trials in rare diseases , 2016, Comput. Stat. Data Anal..

[55]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[56]  Eric B. Laber,et al.  Doubly Robust Learning for Estimating Individualized Treatment with Censored Data. , 2015, Biometrika.

[57]  Alexandre Gramfort,et al.  Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression , 2016, ArXiv.

[58]  Peter S. Fader,et al.  Customer Acquisition via Display Advertising Using Multi-Armed Bandit Experiments , 2016, Mark. Sci..

[59]  Philippe Rigollet,et al.  Nonparametric Bandits with Covariates , 2010, COLT.

[60]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[61]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[62]  Xiuli Chao,et al.  Fast Algorithms for Online Personalized Assortment Optimization in a Big Data Regime , 2019, SSRN Electronic Journal.

[63]  Anil Aswani,et al.  Non-Stationary Bandits with Habituation and Recovery Dynamics , 2017, Oper. Res..

[64]  Cynthia Rudin,et al.  The Big Data Newsvendor: Practical Insights from Machine Learning , 2013, Oper. Res..

[65]  Dimitris Bertsimas,et al.  A Learning Approach for Interactive Marketing to a Customer Segment , 2007, Oper. Res..