Accelerated Policy Evaluation: Learning Adversarial Environments with Adaptive Importance Sampling

The evaluation of rare but high-stakes events remains one of the main difficulties in obtaining reliable policies from intelligent agents, especially in large or continuous state/action spaces where limited scalability enforces the use of a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. In this paper, we propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence properties of proposed algorithms under suitable regularity conditions. Our empirical studies show that APE estimates rare event probability with a smaller variance while only using orders of magnitude fewer samples compared to baseline methods in both multi-agent and single-agent environments.

[1]  Marie-Francine Moens,et al.  Online Continual Learning from Imbalanced Data , 2020, ICML.

[2]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[3]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[4]  Wenhao Ding,et al.  Multimodal Safety-Critical Scenarios Generation for Decision-Making Algorithms Evaluation , 2020, IEEE Robotics and Automation Letters.

[5]  Jinyan Li,et al.  Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data , 2017, PloS one.

[6]  Atul Prakash,et al.  Robust Physical-World Attacks on Deep Learning Visual Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[8]  Brijen Thananjeyan,et al.  Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones , 2020, IEEE Robotics and Automation Letters.

[9]  Vivek S. Borkar,et al.  Adaptive Importance Sampling Technique for Markov Chains Using Stochastic Approximation , 2006, Oper. Res..

[10]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[11]  Shie Mannor,et al.  Reinforcement learning in the presence of rare events , 2008, ICML '08.

[12]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[13]  Christopher Joseph Pal,et al.  Active Domain Randomization , 2019, CoRL.

[14]  Pushmeet Kohli,et al.  Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures , 2018, ICLR.

[15]  Peter W. Glynn,et al.  A Markov chain perspective on adaptive Monte Carlo algorithms , 2001, Proceeding of the 2001 Winter Simulation Conference (Cat. No.01CH37304).

[16]  Gerardo Rubino,et al.  Rare Event Simulation using Monte Carlo Methods , 2009 .

[17]  Andrew Gordon Wilson,et al.  GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration , 2018, NeurIPS.

[18]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[19]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[20]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[21]  Ding Zhao,et al.  Accelerated Evaluation of Automated Vehicles in Car-Following Maneuvers , 2016, IEEE Transactions on Intelligent Transportation Systems.

[22]  Emma Brunskill,et al.  Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding , 2020, NeurIPS.

[23]  Shimon Whiteson,et al.  OFFER: Off-Environment Reinforcement Learning , 2017, AAAI.

[24]  Shie Mannor,et al.  Action Robust Reinforcement Learning and Applications in Continuous Control , 2019, ICML.

[25]  Xavier Gabaix,et al.  Variable Rare Disasters: An Exactly Solved Framework for Ten Puzzles in Macro-Finance , 2007 .

[26]  F. Cérou,et al.  Adaptive Multilevel Splitting for Rare Event Analysis , 2007 .

[27]  J. Tsitsiklis,et al.  An optimal one-way multigrid algorithm for discrete-time stochastic control , 1991 .

[28]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[29]  Arrigo Schieppati,et al.  Why rare diseases are an important medical and social issue , 2008, The Lancet.

[30]  Sandy H. Huang,et al.  Adversarial Attacks on Neural Network Policies , 2017, ICLR.

[31]  Cho-Jui Hsieh,et al.  Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations , 2020, NeurIPS.

[32]  Dan Jiang,et al.  Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling , 2019, WSDM.

[33]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[34]  J. Peters,et al.  Approximate dynamic programming with Gaussian processes , 2008, 2008 American Control Conference.

[35]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[36]  Gerardo Rubino,et al.  Introduction to Rare Event Simulation , 2009, Rare Event Simulation using Monte Carlo Methods.

[37]  Wenhao Ding,et al.  Deep Probabilistic Accelerated Evaluation: A Certifiable Rare-Event Simulation Methodology for Black-Box Autonomy , 2020, ArXiv.

[38]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[39]  Duy Nguyen-Tuong,et al.  Local Gaussian Process Regression for Real Time Online Model Learning , 2008, NIPS.

[40]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[41]  Shie Mannor,et al.  Distributionally Robust Markov Decision Processes , 2010, Math. Oper. Res..

[42]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[43]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[44]  Mykel J. Kochenderfer,et al.  A Survey of Algorithms for Black-Box Safety Validation , 2020, J. Artif. Intell. Res..

[45]  Hongyi Zhou,et al.  Safe Model-based Reinforcement Learning with Robust Cross-Entropy Method , 2020, ArXiv.

[46]  Sehoon Ha,et al.  Learning to be Safe: Deep RL with a Safety Critic , 2020, ArXiv.

[47]  Shalabh Bhatnagar,et al.  Analyzing Approximate Value Iteration Algorithms , 2021, Mathematics of Operations Research.

[48]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[49]  Martha White,et al.  Importance Resampling for Off-policy Prediction , 2019, NeurIPS.

[50]  V.F. Nicola,et al.  Adaptive importance sampling simulation of queueing networks , 2000, 2000 Winter Simulation Conference Proceedings (Cat. No.00CH37165).

[51]  Wenhao Ding,et al.  Context-Aware Safe Reinforcement Learning for Non-Stationary Environments , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Dustin Tran,et al.  Variational Gaussian Process , 2015, ICLR.

[53]  Russ Tedrake,et al.  Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation , 2018, NeurIPS.

[54]  Ming-Yu Liu,et al.  Tactics of Adversarial Attack on Deep Reinforcement Learning Agents , 2017, IJCAI.

[55]  Peter Stone,et al.  Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[56]  Philip Heidelberger,et al.  Fast simulation of rare events in queueing and reliability models , 1993, TOMC.

[57]  Geunseob Oh,et al.  HCNAF: Hyper-Conditioned Neural Autoregressive Flow and its Application for Probabilistic Occupancy Map Forecasting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[59]  Francisco S. Melo,et al.  Q -Learning with Linear Function Approximation , 2007, COLT.

[60]  Peter W. Glynn,et al.  Stochastic Simulation: Algorithms and Analysis , 2007 .

[61]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[62]  Tao Sun,et al.  Robust Multi-Agent Reinforcement Learning with Model Uncertainty , 2020, NeurIPS.

[63]  Max Welling,et al.  Learning Likelihoods with Conditional Normalizing Flows , 2019, ArXiv.

[64]  Paul Glasserman,et al.  Multilevel Splitting for Estimating Rare Event Probabilities , 1999, Oper. Res..

[65]  Ding Zhao,et al.  Accelerated Evaluation of Automated Vehicles. , 2016 .

[66]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[67]  John M. Rushby,et al.  Bus Architectures for Safety-Critical Embedded Systems , 2001, EMSOFT.

[68]  Pieter Tjerk de Boer,et al.  Analysis and efficient simulation of queueing models of telecommunications systems , 2000 .

[69]  András György,et al.  Detecting Overfitting via Adversarial Examples , 2019, NeurIPS.

[70]  Wenhao Ding,et al.  Task-Agnostic Online Reinforcement Learning with an Infinite Mixture of Gaussian Processes , 2020, NeurIPS.

[71]  Nidhi Kalra,et al.  Driving to Safety , 2016 .

[72]  José Villén-Altamirano,et al.  RESTART: a straightforward method for fast simulation of rare events , 1994, Proceedings of Winter Simulation Conference.