Learning When-to-Treat Policies

Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an "advantage doubly robust" estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions.

[1]  Dimitris Bertsimas,et al.  From Predictive to Prescriptive Analytics , 2014, Manag. Sci..

[2]  A. Schick On Asymptotically Efficient Estimation in Semiparametric Models , 1986 .

[3]  John Rust Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher , 1987 .

[4]  Donglin Zeng,et al.  Estimating Individualized Treatment Rules Using Outcome Weighted Learning , 2012, Journal of the American Statistical Association.

[5]  William B. Haskell,et al.  Empirical Dynamic Programming , 2013, Math. Oper. Res..

[6]  M. J. van der Laan,et al.  The International Journal of Biostatistics Causal Effect Models for Realistic Individualized Treatment and Intention to Treat Rules , 2011 .

[7]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[8]  Xiaohong Chen Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models , 2007 .

[9]  Sören R. Künzel,et al.  Metalearners for estimating heterogeneous treatment effects using machine learning , 2017, Proceedings of the National Academy of Sciences.

[10]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[11]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[12]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[13]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[14]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[15]  Christoph Dann,et al.  Sample Efficient Policy Search for Optimal Stopping Domains , 2017, IJCAI.

[16]  Peter Stone,et al.  Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.

[17]  Michael R. Kosorok,et al.  Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning , 2016, Journal of the American Statistical Association.

[18]  A. Belloni,et al.  Program evaluation with high-dimensional data , 2013 .

[19]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[20]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[21]  Adam N. Elmachtoub,et al.  Smart "Predict, then Optimize" , 2017, Manag. Sci..

[22]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[23]  J. Robins,et al.  G-Estimation of the Effect of Prophylaxis Therapy for Pneumocystis carinii Pneumonia on the Survival of AIDS Patients , 1992, Epidemiology.

[24]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[25]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[26]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[27]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[28]  Barbara E. Engelhardt,et al.  A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units , 2017, UAI.

[29]  Ying Liu,et al.  Augmented outcome‐weighted learning for estimating optimal dynamic treatment regimens , 2018, Statistics in medicine.

[30]  M. Kramer,et al.  Estimating Response-Maximized Decision Rules With Applications to Breastfeeding , 2009 .

[31]  Philip S. Thomas,et al.  Importance Sampling for Fair Policy Selection , 2017, UAI.

[32]  James M. Robins,et al.  Optimal Structural Nested Models for Optimal Sequential Decisions , 2004 .

[33]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[34]  W. Newey,et al.  Double machine learning for treatment and causal parameters , 2016 .

[35]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[36]  Susan Athey,et al.  Design-Based Analysis in Difference-in-Differences Settings with Staggered Adoption , 2018, Journal of Econometrics.

[37]  S. Athey,et al.  Generalized random forests , 2016, The Annals of Statistics.

[38]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[39]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[40]  Stephen R Cole,et al.  Timing of initiation of antiretroviral therapy in AIDS-free HIV-1-infected patients: a collaborative analysis of 18 HIV cohort studies , 2009, The Lancet.

[41]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[42]  C. Manski Statistical treatment rules for heterogeneous populations , 2003 .

[43]  Ernesto Mordecki,et al.  Optimal stopping and perpetual options for Lévy processes , 2002, Finance Stochastics.

[44]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[45]  Marie Davidian,et al.  Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. , 2013, Biometrika.

[46]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[47]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[48]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[49]  B. Chakraborty,et al.  Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine , 2013 .

[50]  Anastasios A. Tsiatis,et al.  Dynamic Treatment Regimes , 2019 .

[51]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[52]  S. Jacka Optimal Stopping and the American Put , 1991 .

[53]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[54]  D. Rubin,et al.  Causal Inference for Statistics, Social, and Biomedical Sciences: A General Method for Estimating Sampling Variances for Standard Estimators for Average Causal Effects , 2015 .

[55]  P. Moerbeke On optimal stopping and free boundary problems , 1973, Advances in Applied Probability.

[56]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[57]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[58]  Yao Liu,et al.  Representation Balancing MDPs for Off-Policy Policy Evaluation , 2018, NeurIPS.

[59]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[60]  Eric B. Laber,et al.  Q-Learning: Theory and Applications , 2020, Annual Review of Statistics and Its Application.

[61]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[62]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[63]  J. Robins,et al.  Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models , 1999 .

[64]  J. Robins,et al.  Semiparametric Efficiency in Multivariate Regression Models with Missing Data , 1995 .

[65]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[66]  P. Moerbeke,et al.  On optimal stopping and free boundary problems , 1973, Advances in Applied Probability.

[67]  Philip W. Lavori,et al.  A design for testing clinical strategies: biased adaptive within‐subject randomization , 2000 .

[68]  Suchi Saria,et al.  Reliable Decision Support using Counterfactual Models , 2017, NIPS.

[69]  Donglin Zeng,et al.  New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes , 2015, Journal of the American Statistical Association.

[70]  Antoine Chambaz,et al.  Faster Rates for Policy Learning , 2017, 1704.06431.

[71]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[72]  J. Robins,et al.  Estimation and extrapolation of optimal treatment and testing strategies , 2008, Statistics in medicine.

[73]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[74]  J. Robins,et al.  Marginal Structural Models to Estimate the Joint Causal Effect of Nonrandomized Treatments , 2001 .

[75]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[76]  Erica E M Moodie,et al.  Demystifying Optimal Dynamic Treatment Regimes , 2007, Biometrics.

[77]  Eric B. Laber,et al.  Doubly Robust Learning for Estimating Individualized Treatment with Censored Data. , 2015, Biometrika.

[78]  Xinkun Nie,et al.  Quasi-oracle estimation of heterogeneous treatment effects , 2017, Biometrika.

[79]  Zhengyuan Zhou,et al.  Offline Multi-Action Policy Learning: Generalization and Optimization , 2018, Oper. Res..

[80]  Mark J. van der Laan,et al.  One-Step TMLE , 2018 .

[81]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[82]  Yichi Zhanga,et al.  Estimation of Optimal Treatment Regimes Using Lists , 2018 .

[83]  Stijn Vansteelandt,et al.  Structural nested models and G-estimation: the partially realized promise , 2014, 1503.01589.

[84]  Marie Davidian,et al.  Using decision lists to construct interpretable and parsimonious treatment regimes , 2015, Biometrics.

[85]  J. Robins Correcting for non-compliance in randomized trials using structural nested mean models , 1994 .

[86]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[87]  Alʹbert Nikolaevich Shiri︠a︡ev,et al.  Optimal Stopping and Free-Boundary Problems , 2006 .

[88]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[89]  Marie Davidian,et al.  Interpretable Dynamic Treatment Regimes , 2016, Journal of the American Statistical Association.

[90]  W. Newey,et al.  The asymptotic variance of semiparametric estimators , 1994 .

[91]  M. Wicks,et al.  Concentration of Measure , 2014 .

[92]  Xinkun Nie,et al.  Learning Objectives for Treatment Effect Estimation , 2017 .

[93]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[94]  Devavrat Shah,et al.  Q-learning with Nearest Neighbors , 2018, NeurIPS.

[95]  Nathan Kallus,et al.  Confounding-Robust Policy Improvement , 2018, NeurIPS.

[96]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[97]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[98]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[99]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[100]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[101]  Min Zhang,et al.  Estimating optimal treatment regimes from a classification perspective , 2012, Stat.

[102]  Michael R Kosorok,et al.  Residual Weighted Learning for Estimating Individualized Treatment Rules , 2015, Journal of the American Statistical Association.

[103]  J. Robins,et al.  Locally Robust Semiparametric Estimation , 2016, Econometrica.

[104]  Toru Kitagawa,et al.  Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice , 2015 .

[105]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[106]  A. Belloni,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011, 1201.0224.