Learning robust policies when losing control

Many real-world applications require control strategies that provide robustness against a model of potential temporary external control, such as failures of the designed controller or malicious attacks. In this article we assume a Markovian control model as a stepping stone towards extending Q-learning akin to the options framework, but addressing the risk of losing control involuntarily to possibly malicious ‘options’. The resulting reinforcement learning algorithm maximises expected return, and is model-free with respect to domain dynamics, but model-based with respect to control transitions. Our model allows to exploit parallel off-policy updates to efficiently learn from experience. Results demonstrate that effective safe strategies can be learned from mistakes, possibly even before attacks occur. Our algorithm compares favourably to on-policies SARSA and Expected SARSA and off-policy Q-learning in a multi-agent benchmark, can be trained using forward domain models, and is compatible with many state of the art extensions, such as deep learning, Retrace(λ), or Q(σ ). We thus pave the way to learn robust strategies in critical multi-agent domains, such as smart grids, where graceful degradation is a prerequisite.

[1]  Vincent Conitzer,et al.  Stackelberg vs. Nash in Security Games: An Extended Investigation of Interchangeability, Equivalence, and Uniqueness , 2011, J. Artif. Intell. Res..

[2]  Sui Ruan,et al.  Patrolling in a Stochastic Environment , 2005 .

[3]  Sarit Kraus,et al.  Deployed ARMOR protection: the application of a game theoretic model for security at the Los Angeles International Airport , 2008, AAMAS 2008.

[4]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[5]  J. Doyle,et al.  Essentials of Robust Control , 1997 .

[6]  Hamid Sharif,et al.  A Survey on Smart Grid Communication Infrastructures: Motivations, Requirements and Challenges , 2013, IEEE Communications Surveys & Tutorials.

[7]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[8]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[9]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[10]  Karl Tuyls,et al.  Markov Security Games : Learning in Spatial Security Problems , 2016 .

[11]  Bo An,et al.  PROTECT: a deployed game theoretic system to protect the ports of the United States , 2012, AAMAS.

[12]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[13]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[14]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[15]  Yevgeniy Vorobeychik,et al.  Multidefender Security Games , 2015, IEEE Intelligent Systems.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[18]  V. Kononen,et al.  Asymmetric multiagent reinforcement learning , 2003, IEEE/WIC International Conference on Intelligent Agent Technology, 2003. IAT 2003..

[19]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.