Bandit Algorithms for Precision Medicine

The Oxford English Dictionary defines precision medicine as “medical care designed to optimize efficiency or therapeutic benefit for particular groups of patients, especially by using genetic or molecular profiling.” It is not an entirely new idea: physicians from ancient times have recognized that medical treatment needs to consider individual variations in patient characteristics (Konstantinidou et al., 2017). However, the modern precision medicine movement has been enabled by a confluence of events: scientific advances in fields such as genetics and pharmacology, technological advances in mobile devices and wearable sensors, and methodological advances in computing and data sciences. This chapter is about bandit algorithms: an area of data science of special relevance to precision medicine. With their roots in the seminal work of Bellman, Robbins, Lai and others, bandit algorithms have come to occupy a central place in modern data science (see the book by Lattimore and Szepesvári (2020) for an up-to-date treatment). Bandit algorithms can be used in any situation where treatment decisions need to be made to optimize some health outcome. Since precision medicine focuses on the use of patient characteristics to guide treatment, contextual bandit algorithms are especially useful since they are designed to take such information into account. The role of bandit algorithms in areas of precision medicine such as mobile health and digital phenotyping has been reviewed before (Tewari and Murphy, 2017; Rabbi et al., 2019). Since these reviews were published, bandit algorithms have continued to find uses in mobile health and several new topics have emerged in the research on bandit algorithms. This chapter is written for quantitative researchers in fields such as statistics, machine learning, and operations research who might be interested in knowing more about the algorithmic and mathematical details of bandit algorithms that have been used in mobile health. We have organized this chapter to meet two goals. First, we want to provide a concise exposition of basic topics in bandit algorithms. Section 2 will help the reader become familiar with basic problem setups and algorithms that appear frequently in applied work in precision medicine and mobile health (see, for example, Paredes et al. (2014); Piette et al. (2015); Rabbi et al. (2015); Piette et al. (2016); Yom-Tov et al. (2017); Rindtorff et al. (2019); Forman et al. (2019); Liao et al. (2020); Ameko et al. (2020); Aguilera et al. (2020); Tomkins et al. (2021)). Second, we want to highlight a few advanced topics that are important for mobile health and precision medicine applications but whose full potential remains to be realized. Section 3 will provide the reader with helpful entry points into the bandit literature on non-stationarity, robustness to corrupted rewards, satisfying additional constraints, algorithmic fairness, and causality.

[1]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[2]  Y. Narahari,et al.  Achieving Fairness in the Stochastic Multi-armed Bandit Problem , 2019, AAAI.

[3]  Alessandro Lazaric,et al.  Improved Algorithms for Conservative Exploration in Bandits , 2020, AAAI.

[4]  B. Chakraborty,et al.  mHealth app using machine learning to increase physical activity in diabetes and depression: clinical trial protocol for the DIAMANTE Study , 2020, BMJ Open.

[5]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[6]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[7]  M. Panagopoulou,et al.  Are the Origins of Precision Medicine Found in the Corpus Hippocraticum? , 2017, Molecular Diagnosis & Therapy.

[8]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[9]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[10]  Aaron Roth,et al.  Fairness in Learning: Classic and Contextual Bandits , 2016, NIPS.

[11]  Xintao Wu,et al.  Achieving User-Side Fairness in Contextual Bandits , 2020, Human-Centric Intelligent Systems.

[12]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[13]  Haipeng Luo,et al.  A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free , 2019, COLT.

[14]  Thorsten Joachims,et al.  Fairness of Exposure in Stochastic Bandits , 2021, ICML.

[15]  Nenghai Yu,et al.  Thompson Sampling for Budgeted Multi-Armed Bandits , 2015, IJCAI.

[16]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[17]  Alicia R. Martin,et al.  Clinical use of current polygenic risk scores may exacerbate health disparities , 2019, Nature Genetics.

[18]  Archie C. Chapman,et al.  Epsilon-First Policies for Budget-Limited Multi-Armed Bandits , 2010, AAAI.

[19]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[20]  Claudio Gentile,et al.  A Gang of Bandits , 2013, NIPS.

[21]  R. Srikant,et al.  Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits , 2015, NIPS.

[22]  Zhuoran Yang,et al.  Is Pessimism Provably Efficient for Offline RL? , 2020, ICML.

[23]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[24]  Ashish Kapoor,et al.  Safety-Aware Algorithms for Adversarial Contextual Bandit , 2017, ICML.

[25]  Ürün Dogan,et al.  Multi-Task Learning for Contextual Bandits , 2017, NIPS.

[26]  Olivier Nicol,et al.  Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques , 2014, ICML.

[27]  Claire J. Tomlin,et al.  Budget-Constrained Multi-Armed Bandits with Multiple Plays , 2017, AAAI.

[28]  Tor Lattimore,et al.  Optimally Confident UCB : Improved Regret for Finite-Armed Bandits , 2015, ArXiv.

[29]  S. Murphy,et al.  A "SMART" design for building individualized treatment sequences. , 2012, Annual review of clinical psychology.

[30]  J. Paulus,et al.  Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities , 2020, npj Digital Medicine.

[31]  Ambuj Tewari,et al.  Causal Bandits with Unknown Graph Structure , 2021, NeurIPS.

[32]  Martin J. Wainwright,et al.  Minimax Off-Policy Evaluation for Multi-Armed Bandits , 2021, IEEE Transactions on Information Theory.

[33]  Shuai Li,et al.  Collaborative Filtering Bandits , 2015, SIGIR.

[34]  Christopher Jung,et al.  Online Learning with an Unknown Fairness Metric , 2018, NeurIPS.

[35]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[36]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[37]  Assaf J. Zeevi,et al.  A Note on Performance Limitations in Bandit Problems With Side Information , 2011, IEEE Transactions on Information Theory.

[38]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[39]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[40]  Yang Liu,et al.  Calibrated Fairness in Bandits , 2017, ArXiv.

[41]  Shuai Li,et al.  Online Clustering of Bandits , 2014, ICML.

[42]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[43]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[44]  Alexander D'Amour,et al.  A Biologically Plausible Benchmark for Contextual Bandit Algorithms in Precision Oncology Using in vitro Data , 2019, ArXiv.

[45]  Sanjeev R. Kulkarni,et al.  Arbitrary side observations in bandit problems , 2005, Adv. Appl. Math..

[46]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[47]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[48]  Yifan Wu,et al.  Conservative Bandits , 2016, ICML.

[49]  Tor Lattimore,et al.  Causal Bandits: Learning Good Interventions via Causal Inference , 2016, NIPS.

[50]  Mehdi Boukhechba,et al.  Offline Contextual Multi-armed Bandits for Mobile Health Interventions: A Case Study on Emotion Regulation , 2020, RecSys.

[51]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[52]  Susan A. Murphy,et al.  Statistical Inference with M-Estimators on Bandit Data , 2021, ArXiv.

[53]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[54]  Mi Zhang,et al.  MyBehavior: automatic personalized health feedback from user behaviors and preferences using smartphones , 2015, UbiComp.

[55]  Yuhong Yang,et al.  RANDOMIZED ALLOCATION WITH NONPARAMETRIC ESTIMATION FOR A MULTI-ARMED BANDIT PROBLEM WITH COVARIATES , 2002 .

[56]  Vineet Nair,et al.  Budgeted and Non-budgeted Causal Bandits , 2020, AISTATS.

[57]  Tao Qin,et al.  Multi-Armed Bandit with Budget Constraint and Variable Costs , 2013, AAAI.

[58]  Anupam Gupta,et al.  Better Algorithms for Stochastic Bandits with Adversarial Corruptions , 2019, COLT.

[59]  Haipeng Luo,et al.  Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach , 2021, COLT.

[60]  Purushottam Kar,et al.  Corruption-tolerant bandit learning , 2018, Machine Learning.

[61]  Xiaokui Xiao,et al.  MOTS: Minimax Optimal Thompson Sampling , 2020, ArXiv.

[62]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[63]  Brian W. Powers,et al.  Dissecting racial bias in an algorithm used to manage the health of populations , 2019, Science.

[64]  Renato Paes Leme,et al.  Stochastic bandits robust to adversarial corruptions , 2018, STOC.

[65]  Thorsten Joachims,et al.  Fairness of Exposure in Rankings , 2018, KDD.

[66]  Ran Gilad-Bachrach,et al.  PopTherapy: coping with stress through pop-culture , 2014, PervasiveHealth.

[67]  Karen B. Farris,et al.  The Potential Impact of Intelligent Systems for Mobile Health Self-Management Support: Monte Carlo Simulations of Text Message Support for Medication Adherence , 2014, Annals of behavioral medicine : a publication of the Society of Behavioral Medicine.

[68]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[69]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[70]  Sarah L Krein,et al.  Patient-Centered Pain Care Using Artificial Intelligence and Mobile Health Tools: Protocol for a Randomized Study Funded by the US Department of Veterans Affairs Health Services Research and Development Program , 2016, JMIR research protocols.

[71]  Ambuj Tewari,et al.  Low-Rank Generalized Linear Bandit Problems , 2020, AISTATS.

[72]  Tor Lattimore,et al.  Refining the Confidence Level for Optimistic Bandit Strategies , 2018, J. Mach. Learn. Res..

[73]  Wonyoung Kim,et al.  Doubly Robust Thompson Sampling for linear payoffs , 2021, ArXiv.

[74]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[75]  Yu Zhang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[76]  Peter Auer,et al.  Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes , 2019, COLT.

[77]  Michael Matthews,et al.  The Alignment Problem: Machine Learning and Human Values , 2022, Personnel Psychology.

[78]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[79]  Elias Bareinboim,et al.  Structural Causal Bandits with Non-Manipulable Variables , 2019, AAAI.

[80]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[81]  Zheng Wen,et al.  Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit , 2018, AISTATS.

[82]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[83]  Csaba Szepesvári,et al.  Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[84]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[85]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[86]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[87]  Alexandros G. Dimakis,et al.  Identifying Best Interventions through Online Importance Sampling , 2017, ICML.

[88]  Alessandro Lazaric,et al.  Conservative Exploration in Reinforcement Learning , 2020, AISTATS.

[89]  Yasin Abbasi-Yadkori,et al.  The Elliptical Potential Lemma Revisited , 2020, ArXiv.

[90]  Nicholas Mattei,et al.  Group Fairness in Bandit Arm Selection , 2019, ArXiv.

[91]  David Haussler,et al.  The Probably Approximately Correct (PAC) and Other Learning Models , 1993 .

[92]  Nicole Immorlica,et al.  Adversarial Bandits with Knapsacks , 2018, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[93]  Predrag Klasnja,et al.  IntelligentPooling: practical Thompson sampling for mHealth , 2021, Mach. Learn..

[94]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[95]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[96]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[97]  Ambuj Tewari,et al.  Optimizing mHealth Interventions with a Bandit , 2019, Studies in Neuroscience, Psychology and Behavioral Economics.

[98]  Tor Lattimore,et al.  High-Dimensional Sparse Linear Bandits , 2020, NeurIPS.

[99]  Xingzhi Sun,et al.  Reinforcement Learning for Clinical Decision Support in Critical Care: Comprehensive Review , 2020, Journal of medical Internet research.

[100]  M. Clayton Covariate models for bernoulli bandits , 1989 .

[101]  J. Sarkar One-Armed Bandit Problems with Covariates , 1991 .

[102]  Elias Bareinboim,et al.  Structural Causal Bandits: Where to Intervene? , 2018, NeurIPS.

[103]  Kristjan H. Greenewald,et al.  Personalized HeartSteps , 2019, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[104]  Sébastien Gerchinovitz,et al.  Sparsity Regret Bounds for Individual Sequences in Online Linear Regression , 2011, COLT.

[105]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[106]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[107]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[108]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[109]  Moshe Tennenholtz,et al.  Encouraging Physical Activity in Patients With Diabetes: Intervention Using a Reinforcement Learning System , 2017, Journal of medical Internet research.

[110]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[111]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[112]  Archie C. Chapman,et al.  Knapsack Based Optimal Policies for Budget-Limited Multi-Armed Bandits , 2012, AAAI.

[113]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[114]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[115]  Nikhil R. Devanur,et al.  Linear Contextual Bandits with Knapsacks , 2015, NIPS.

[116]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[117]  Ambuj Tewari,et al.  Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. , 2015, Health psychology : official journal of the Division of Health Psychology, American Psychological Association.

[118]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[119]  Santiago Ontañón,et al.  Can the artificial intelligence technique of reinforcement learning use continuously-monitored digital data to optimize treatment for weight loss? , 2018, Journal of Behavioral Medicine.

[120]  Ambuj Tewari,et al.  Regret Analysis of Bandit Problems with Causal Background Knowledge , 2019, UAI.

[121]  H. Mamani,et al.  How Do Tumor Cytogenetics Inform Cancer Treatments? Dynamic Risk Stratification and Precision Medicine Using Multi-armed Bandits , 2019, SSRN Electronic Journal.

[122]  Sonia Jain,et al.  A Bayesian‐bandit adaptive design for N‐of‐1 clinical trials , 2021, Statistics in medicine.