Rarely-switching linear bandits: optimization of causal effects for the real world

Excessively changing policies in many real world scenarios is difficult, unethical, or expensive. After all, doctor guidelines, tax codes, and price lists can only be reprinted so often. We may thus want to only change a policy when it is probable that the change is beneficial. In cases that a policy is a threshold on contextual variables we can estimate treatment effects for populations lying at the threshold. This allows for a schedule of incremental policy updates that let us optimize a policy while making few detrimental changes. Using this idea, and the theory of linear contextual bandits, we present a conservative policy updating procedure which updates a deterministic policy only when justified. We extend the theory of linear bandits to this rarely-switching case, proving that such procedures share the same regret, up to constant scaling, as the common LinUCB algorithm. However the algorithm makes far fewer changes to its policy and, of those changes, fewer are detrimental. We provide simulations and an analysis of an infant health well-being causal inference dataset, showing the algorithm efficiently learns a good policy with few changes. Our approach allows efficiently solving problems where changes are to be avoided, with potential applications in medicine, economics and beyond.

[1]  Romain Laroche,et al.  Scaling up budgeted reinforcement learning , 2019, ArXiv.

[2]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[3]  Suvrit Sra,et al.  First-order Methods for Geodesically Convex Optimization , 2016, COLT.

[4]  Jackson T. Wright,et al.  Evidence-Based Guideline for the Management of High Blood Pressure in Adults , 2012 .

[5]  Constantine Caramanis,et al.  Theory and Applications of Robust Optimization , 2010, SIAM Rev..

[6]  Matias D. Cattaneo,et al.  Interpreting Regression Discontinuity Designs with Multiple Cutoffs , 2016, The Journal of Politics.

[7]  Elias Bareinboim,et al.  Counterfactual Data-Fusion for Online Reinforcement Learners , 2017, ICML.

[8]  Zheng Wen,et al.  Conservative Exploration using Interleaving , 2018, AISTATS.

[9]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[10]  Tor Lattimore,et al.  Causal Bandits: Learning Good Interventions via Causal Inference , 2016, NIPS.

[11]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[12]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[13]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[14]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[15]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[16]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[17]  Sudipto Guha,et al.  Multi-armed Bandits with Metric Switching Costs , 2009, ICALP.

[18]  Elias Bareinboim,et al.  Structural Causal Bandits with Non-Manipulable Variables , 2019, AAAI.

[19]  Andreas Krause,et al.  Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics , 2016, Machine Learning.

[20]  Haipeng Luo,et al.  Practical Contextual Bandits with Regression Oracles , 2018, ICML.

[21]  Romain Laroche,et al.  Reinforcement Learning Algorithm Selection , 2017, ICLR.

[22]  Vivian C. Wong,et al.  Analyzing Regression-Discontinuity Designs With Multiple Assignment Variables , 2013 .

[23]  R. Vasan,et al.  Recent Update to the US Cholesterol Treatment Guidelines: A Comparison With International Guidelines. , 2016, Circulation.

[24]  Doina Precup,et al.  Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials , 2019, AISafety@IJCAI.

[25]  Marinho Bertanha,et al.  Regression Discontinuity Design with Many Thresholds , 2017, Journal of Econometrics.

[26]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[27]  Nathan Fulton,et al.  Verifiably Safe Off-Model Reinforcement Learning , 2019, TACAS.

[28]  Till Bärnighausen,et al.  Regression discontinuity designs are underutilized in medicine, epidemiology, and public health: a review of current and best practice. , 2015, Journal of clinical epidemiology.

[29]  Krishnendu Chatterjee,et al.  Generalized Risk-Aversion in Stochastic Multi-Armed Bandits , 2014, ArXiv.

[30]  Christopher S. Carpenter,et al.  The minimum legal drinking age and public health. , 2011, The journal of economic perspectives : a journal of the American Economic Association.

[31]  Milton C Weinstein,et al.  Cost-effectiveness of 10-Year Risk Thresholds for Initiation of Statin Therapy for Primary Prevention of Cardiovascular Disease. , 2015, JAMA.

[32]  Mohammad Ghavamzadeh,et al.  Lyapunov-based Safe Policy Optimization for Continuous Control , 2019, ArXiv.

[33]  W. Yeaton,et al.  Evaluating the Effectiveness of Developmental Mathematics by Embedding a Randomized Experiment Within a Regression Discontinuity Design , 2014 .

[34]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[35]  Ashish Kapoor,et al.  Risk-Aware Algorithms for Adversarial Contextual Bandits , 2016, ArXiv.

[36]  Jonghyun Choi,et al.  Toward Sparse Coding on Cosine Distance , 2014, 2014 22nd International Conference on Pattern Recognition.

[37]  Alden Cheng Regression Discontinuity Designs with Multiple Assignment Variables ∗ Yizhuang , 2016 .

[38]  Konrad Paul Kording,et al.  Quasi-experimental causality in neuroscience and behavioural research , 2018, Nature Human Behaviour.

[39]  Melody Y. Guan,et al.  Nonparametric Stochastic Contextual Bandits , 2018, AAAI.

[40]  Arthur Lewbel,et al.  Identifying the Effect of Changing the Policy Threshold in Regression Discontinuity Models , 2015, Review of Economics and Statistics.

[41]  Susan Athey,et al.  Estimation Considerations in Contextual Bandits , 2017, ArXiv.

[42]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[43]  David S. Lee Randomized experiments from non-random selection in U.S. House elections , 2005 .

[44]  John Langford,et al.  A Contextual Bandit Bake-off , 2018, J. Mach. Learn. Res..

[45]  D. Green,et al.  Adaptive Experimental Design: Prospects and Applications in Political Science , 2019, American Journal of Political Science.

[46]  A. Aldo Faisal,et al.  Understanding the Artificial Intelligence Clinician and optimal treatment strategies for sepsis in intensive care , 2019, ArXiv.

[47]  Stefan Wager,et al.  Optimized Regression Discontinuity Designs , 2017, Review of Economics and Statistics.

[48]  Lei Liu,et al.  Quasi Cosine Similarity Metric Learning , 2014, ACCV Workshops.

[49]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[50]  Richard J. Murnane,et al.  Extending the regression-discontinuity approach to multiple assignment variables , 2011 .

[51]  Roi Livni,et al.  Multi-Armed Bandits with Metric Movement Costs , 2017, NIPS.

[52]  Elias Bareinboim,et al.  Transfer Learning in Multi-Armed Bandit: A Causal Approach , 2017, AAMAS.

[53]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[54]  Roi Livni,et al.  Bandits with Movement Costs and Adaptive Pricing , 2017, COLT.

[55]  Ewout van den Berg,et al.  Some Insights into the Geometry and Training of Neural Networks , 2016, ArXiv.

[56]  Matias D. Cattaneo,et al.  Extrapolating Treatment Effects in Multi-Cutoff Regression Discontinuity Designs , 2018, Journal of the American Statistical Association.

[57]  Alessandro Lazaric,et al.  Risk-Aversion in Multi-armed Bandits , 2012, NIPS.

[58]  Barnabás Póczos,et al.  Cautious Deep Learning , 2018, ArXiv.

[59]  Li Zhou,et al.  A Survey on Contextual Multi-armed Bandits , 2015, ArXiv.

[60]  T. Greenhalgh,et al.  Efficacy and effectiveness of screen and treat policies in prevention of type 2 diabetes: systematic review and meta-analysis of screening tests and interventions , 2017, British Medical Journal.

[61]  Uri Shalit,et al.  Estimating individual treatment effect: generalization bounds and algorithms , 2016, ICML.

[62]  Qing Zhao,et al.  Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure , 2016, IEEE Journal of Selected Topics in Signal Processing.

[63]  Khashayar Khosravi,et al.  Mostly Exploration-Free Algorithms for Contextual Bandits , 2017, Manag. Sci..

[64]  Sampath Kannan,et al.  A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem , 2018, NeurIPS.

[65]  Alfredo N. Iusem,et al.  Concepts and techniques of optimization on the sphere , 2014 .

[66]  Laurent El Ghaoui,et al.  Robust Optimization , 2021, ICORES.

[67]  D. Campbell,et al.  Regression-Discontinuity Analysis: An Alternative to the Ex-Post Facto Experiment , 1960 .

[68]  Shie Mannor,et al.  PAC Bandits with Risk Constraints , 2018, ISAIM.

[69]  Safety-Guided Deep Reinforcement Learning via Online Gaussian Process Estimation , 2019, ArXiv.

[70]  Melvyn Sim,et al.  Robust Discrete Optimization , 2003 .