Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes

A dynamic treatment regime (DTR) consists of a sequence of decision rules, one per stage of intervention, that dictates how to determine the treatment assignment to patients based on evolving treatments and covariates' history. These regimes are particularly effective for managing chronic disorders and is arguably one of the key aspects towards more personalized decision-making. In this paper, we investigate the online reinforcement learning (RL) problem for selecting optimal DTRs provided that observational data is available. We develop the first adaptive algorithm that achieves near-optimal regret in DTRs in online settings, without any access to historical data. We further derive informative bounds on the system dynamics of the underlying DTR from confounded, observational data. Finally, we combine these results and develop a novel RL algorithm that efficiently learns the optimal DTR while leveraging the abundant, yet imperfect confounded observations.

[1]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[2]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[3]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[4]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning in Rich-Observation MDPs using Spectral Methods , 2016, 1611.03907.

[5]  H. Sung,et al.  Selecting Therapeutic Strategies Based on Efficacy and Death in Multicourse Clinical Trials , 2002 .

[6]  H. Sung,et al.  Evaluating multiple treatment courses in clinical trials. , 2000, Statistics in medicine.

[7]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[8]  Bibhas Chakraborty,et al.  Dynamic treatment regimes for managing chronic health conditions: a statistical perspective. , 2011, American journal of public health.

[9]  S. Zionts,et al.  Programming with linear fractional functionals , 1968 .

[10]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[11]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[12]  C. Manski Nonparametric Bounds on Treatment Effects , 1989 .

[13]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[14]  Elias Bareinboim,et al.  Transfer Learning in Multi-Armed Bandit: A Causal Approach , 2017, AAMAS.

[15]  James M. Robins,et al.  Probabilistic evaluation of sequential plans from causal models with hidden variables , 1995, UAI.

[16]  Brian T. Austin,et al.  Improving chronic illness care: translating evidence into action. , 2001, Health affairs.

[17]  Stephen L. George,et al.  Granulocyte–Macrophage Colony-Stimulating Factor after Initial Chemotherapy for Elderly Patients with Primary Acute Myelogenous Leukemia , 1995 .

[18]  Elias Bareinboim,et al.  General Identifiability with Arbitrary Surrogate Experiments , 2019, UAI.

[19]  Franz von Kutschera,et al.  Causation , 1993, J. Philos. Log..

[20]  Philip W. Lavori,et al.  A design for testing clinical strategies: biased adaptive within‐subject randomization , 2000 .

[21]  Donald B. Rubin,et al.  Bayesian Inference for Causal Effects: The Role of Randomization , 1978 .

[22]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[23]  S. Murphy,et al.  An experimental design for the development of adaptive treatment strategies , 2005, Statistics in medicine.

[24]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[27]  Judea Pearl,et al.  Counterfactuals and Policy Analysis in Structural Models , 1995, UAI.

[28]  Ree Dawson,et al.  Adaptive treatment strategies in chronic disease. , 2008, Annual review of medicine.

[29]  Anastasios A. Tsiatis,et al.  Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data , 2006 .

[30]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[31]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[32]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[33]  J. Robins,et al.  Estimation and extrapolation of optimal treatment and testing strategies , 2008, Statistics in medicine.

[34]  Benjamin Van Roy,et al.  Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[35]  Emma Brunskill,et al.  A PAC RL Algorithm for Episodic POMDPs , 2016, AISTATS.

[36]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[37]  S. Murphy,et al.  Dynamic Treatment Regimes. , 2014, Annual review of statistics and its application.

[38]  A. Tsiatis,et al.  Optimal Estimator for the Survival Distribution and Related Quantities for Treatment Policies in Two‐Stage Randomization Designs in Clinical Trials , 2004, Biometrics.

[39]  D. Rubin,et al.  Principal Stratification in Causal Inference , 2002, Biometrics.

[40]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[41]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[42]  Qiang Liu,et al.  Belief Propagation for Structured Decision Making , 2012, UAI.

[43]  B. Chakraborty,et al.  Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine , 2013 .

[44]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[45]  Marie Davidian,et al.  Estimation of Survival Distributions of Treatment Policies in Two‐Stage Randomization Designs in Clinical Trials , 2002, Biometrics.