Robust temporal difference learning for critical domains

We present a new Q-function operator for temporal difference (TD) learning methods that explicitly encodes robustness against significant rare events (SRE) in critical domains. The operator, which we call the $\kappa$-operator, allows to learn a robust policy in a model-based fashion without actually observing the SRE. We introduce single- and multi-agent robust TD methods using the operator $\kappa$. We prove convergence of the operator to the optimal robust Q-function with respect to the model using the theory of Generalized Markov Decision Processes. In addition we prove convergence to the optimal Q-function of the original MDP given that the probability of SREs vanishes. Empirical evaluations demonstrate the superior performance of $\kappa$-based TD methods both in the early learning phase as well as in the final converged stage. In addition we show robustness of the proposed method to small model errors, as well as its applicability in a multi-agent context.

[1]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[2]  Flaviu Cristian,et al.  Fault-tolerance in air traffic control systems , 1996, TOCS.

[3]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  H. Robbins A Stochastic Approximation Method , 1951 .

[6]  John C. Knight,et al.  Safety critical systems: challenges and directions , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Zhang-Wei Hong,et al.  A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[9]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[10]  J. Doyle,et al.  Essentials of Robust Control , 1997 .

[11]  Yevgeniy Vorobeychik,et al.  Multidefender Security Games , 2015, IEEE Intelligent Systems.

[12]  Karl Tuyls,et al.  Markov Security Games : Learning in Spatial Security Problems , 2016 .

[13]  Martin L. Shooman,et al.  Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design , 2002 .

[14]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[15]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[16]  Vincent Conitzer,et al.  Stackelberg vs. Nash in Security Games: An Extended Investigation of Interchangeability, Equivalence, and Uniqueness , 2011, J. Artif. Intell. Res..

[17]  Sui Ruan,et al.  Patrolling in a Stochastic Environment , 2005 .

[18]  E. Zeidler,et al.  Fixed-point theorems , 1986 .

[19]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[20]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[21]  Bo An,et al.  PROTECT: a deployed game theoretic system to protect the ports of the United States , 2012, AAMAS.

[22]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[23]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[24]  Sarit Kraus,et al.  Deployed ARMOR protection: the application of a game theoretic model for security at the Los Angeles International Airport , 2008, AAMAS 2008.

[25]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[26]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[27]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[28]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[29]  Chris Gaskett,et al.  Reinforcement learning under circumstances beyond its control , 2003 .

[30]  Yang Xiao,et al.  Cyber Security and Privacy Issues in Smart Grids , 2012, IEEE Communications Surveys & Tutorials.

[31]  A. Dvoretzky On Stochastic Approximation , 1956 .

[32]  Hamid Sharif,et al.  A Survey on Smart Grid Communication Infrastructures: Motivations, Requirements and Challenges , 2013, IEEE Communications Surveys & Tutorials.

[33]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[34]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[35]  Shimon Whiteson,et al.  OFFER: Off-Environment Reinforcement Learning , 2017, AAAI.