Batch Inverse Reinforcement Learning Using Counterfactuals for Understanding Decision Making

A key challenge in modeling real-world decision-making is the fact that active experimentation is often impossible (e.g. in healthcare). The goal of batch inverse reinforcement learning is to recover and understand policies on the basis of demonstrated behaviour--i.e. trajectories of observations and actions made by an expert maximizing some unknown reward function. We propose incorporating counterfactual reasoning into modeling decision behaviours in this setting. At each decision point, counterfactuals answer the question: Given the current history of observations, what would happen if we took a particular action? First, this offers a principled approach to learning inherently interpretable reward functions, which enables understanding the cost-benefit tradeoffs associated with an expert's actions. Second, by estimating the effects of different actions, counterfactuals readily tackle the off-policy nature of policy evaluation in the batch setting. Not only does this alleviate the cold-start problem typical of conventional solutions, but also accommodates settings where the expert policies are depending on histories of observations rather than just current states. Through experiments in both real and simulated medical environments, we illustrate the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of expert behaviour.

[1]  D. Baram,et al.  Aerosolized antibiotics and ventilator-associated tracheobronchitis in the intensive care unit* , 2008, Critical care medicine.

[2]  Abhik Das,et al.  Between-hospital variation in treatment and outcomes in extremely preterm infants. , 2015, The New England journal of medicine.

[3]  J. Skinner,et al.  Physician Practice Style Variation-Implications for Policy. , 2016, JAMA internal medicine.

[4]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[5]  Kee-Eung Kim,et al.  Inverse Reinforcement Learning in Partially Observable Environments , 2009, IJCAI.

[6]  M. Robins James,et al.  Estimation of the causal effects of time-varying exposures , 2008 .

[7]  Mihaela van der Schaar,et al.  Estimating Counterfactual Treatment Outcomes over Time Through Adversarially Balanced Representations , 2020, ICLR.

[8]  J. Wennberg,et al.  Medical practice variation: public reporting a first necessary step to spark change , 2018, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[9]  Matthieu Geist,et al.  Inverse Reinforcement Learning through Structured Classification , 2012, NIPS.

[10]  Peter Stone,et al.  Recent Advances in Imitation Learning from Observation , 2019, IJCAI.

[11]  Carlo Tomasi,et al.  Distance Minimization for Reward Learning from Scored Trajectories , 2016, AAAI.

[12]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[13]  B. Djulbegovic,et al.  Rational decision making in medicine: Implications for overuse and underuse , 2017, Journal of evaluation in clinical practice.

[14]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[15]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[16]  Suchi Saria,et al.  Reliable Decision Support using Counterfactual Models , 2017, NIPS.

[17]  Mohamed Medhat Gaber,et al.  Imitation Learning , 2017, ACM Comput. Surv..

[18]  Ioana Bica,et al.  From Real‐World Patient Data to Individualized Treatment Effects Using Machine Learning: Current and Future Methods to Address Underlying Challenges , 2020, Clinical pharmacology and therapeutics.

[19]  Donald B. Rubin,et al.  Bayesian Inference for Causal Effects: The Role of Randomization , 1978 .

[20]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[21]  Takaki Makino,et al.  Apprenticeship Learning for Model Parameters of Partially Observable Environments , 2012, ICML.

[22]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[23]  P. Marik Fever in the ICU. , 2000, Chest.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[26]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  S. Cole,et al.  Time-modified confounding. , 2009, American journal of epidemiology.

[28]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[29]  Hang Yin,et al.  Synthesizing Robotic Handwriting Motion by Learning from Human Demonstrations , 2016, IJCAI.

[30]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[31]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[32]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[33]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[34]  M E Hammond,et al.  The challenge of variation in medical practice. , 2009, Archives of pathology & laboratory medicine.

[35]  Srivatsan Srinivasan,et al.  Truly Batch Apprenticeship Learning with Deep Successor Features , 2019, IJCAI.

[36]  Kazem Mohammad,et al.  Effect of Physical Activity on Functional Performance and Knee Pain in Patients With Osteoarthritis: Analysis With Marginal Structural Models , 2012, Epidemiology.

[37]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[38]  Nicolas Heess,et al.  Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search , 2018, ICLR.

[39]  Matthieu Geist,et al.  Batch, Off-Policy and Model-Free Apprenticeship Learning , 2011, EWRL.

[40]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[41]  Bryan Lim,et al.  Forecasting Treatment Responses Over Time Using Recurrent Marginal Structural Networks , 2018, NeurIPS.