论文信息 - Truly Batch Apprenticeship Learning with Deep Successor Features - 字舞流文

Truly Batch Apprenticeship Learning with Deep Successor Features

We introduce a novel apprenticeship learning algorithm to learn an expert's underlying reward structure in off-policy model-free \emph{batch} settings. Unlike existing methods that require a dynamics model or additional data acquisition for on-policy evaluation, our algorithm requires only the batch data of observed expert behavior. Such settings are common in real-world tasks---health care, finance or industrial processes ---where accurate simulators do not exist or data acquisition is costly. To address challenges in batch settings, we introduce Deep Successor Feature Networks(DSFN) that estimate feature expectations in an off-policy setting and a transition-regularized imitation network that produces a near-expert initial policy and an efficient feature representation. Our algorithm achieves superior results in batch settings on both control benchmarks and a vital clinical task of sepsis management in the Intensive Care Unit.

Srivatsan Srinivasan | Finale Doshi-Velez | Donghun Lee | Finale Doshi-Velez | Donghun Lee | Srivatsan Srinivasan | F. Doshi-Velez

[1] Matthieu Geist,et al. Batch, Off-Policy and Model-Free Apprenticeship Learning , 2011, EWRL.

[2] Wolfram Burgard,et al. Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics , 2016, AISTATS.

[3] Patrick J. Roa. Volume 8 , 2001 .

[4] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[5] Bin Yu,et al. Artificial intelligence and statistics , 2018, Frontiers of Information Technology & Electronic Engineering.

[6] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[7] Lawrence Carin,et al. Linear Feature Encoding for Reinforcement Learning , 2016, NIPS.

[8] Laurent Orseau,et al. AI Safety Gridworlds , 2017, ArXiv.

[9] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[10] Matthieu Geist,et al. Inverse Reinforcement Learning through Structured Classification , 2012, NIPS.

[11] Sergey Levine,et al. Feature Construction for Inverse Reinforcement Learning , 2010, NIPS.

[12] Matthieu Geist,et al. Boosted and reward-regularized classification for apprenticeship learning , 2014, AAMAS.

[13] Matthieu Geist,et al. Learning from Demonstrations: Is It Worth Estimating a Reward Function? , 2013, ECML/PKDD.

[14] Peter Szolovits,et al. MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[15] Peter Szolovits,et al. Deep Reinforcement Learning for Sepsis Treatment , 2017, ArXiv.

[16] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[17] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[18] Tom Schaul,et al. Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[19] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[20] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[21] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[22] Matthieu Geist,et al. A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning , 2013, ECML/PKDD.

[23] William W. Cohen,et al. Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[24] Servicio Geológico Colombiano Sgc. Volume 4 , 2013, Journal of Diabetes Investigation.

[25] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[26] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[27] R. Bellomo,et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). , 2016, JAMA.

[28] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[29] J. Andrew Bagnell,et al. Efficient Reductions for Imitation Learning , 2010, AISTATS.

[30] Michael J. Watts,et al. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Publication Information , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[31] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[32] J. Andrew Bagnell,et al. Maximum margin planning , 2006, ICML.

[33] Marc Toussaint,et al. Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[34] Samuel Gershman,et al. Deep Successor Reinforcement Learning , 2016, ArXiv.

[35] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..