论文信息 - Robust temporal difference learning for critical domains

Robust temporal difference learning for critical domains

We present a new Q-function operator for temporal difference (TD) learning methods that explicitly encodes robustness against significant rare events (SRE) in critical domains. The operator, which we call the $\kappa$-operator, allows to learn a robust policy in a model-based fashion without actually observing the SRE. We introduce single- and multi-agent robust TD methods using the operator $\kappa$. We prove convergence of the operator to the optimal robust Q-function with respect to the model using the theory of Generalized Markov Decision Processes. In addition we prove convergence to the optimal Q-function of the original MDP given that the probability of SREs vanishes. Empirical evaluations demonstrate the superior performance of $\kappa$-based TD methods both in the early learning phase as well as in the final converged stage. In addition we show robustness of the proposed method to small model errors, as well as its applicability in a multi-agent context.

[1] Roderic A. Grupen,et al. Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[2] Flaviu Cristian,et al. Fault-tolerance in air traffic control systems , 1996, TOCS.

[3] Jun Morimoto,et al. Robust Reinforcement Learning , 2005, Neural Computation.

[4] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5] H. Robbins. A Stochastic Approximation Method , 1951 .

[6] John C. Knight,et al. Safety critical systems: challenges and directions , 2002, Proceedings of the 24th International Conference on Software Engineering. ICSE 2002.

[7] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8] Zhang-Wei Hong,et al. A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[9] Richard S. Sutton,et al. Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[10] J. Doyle,et al. Essentials of Robust Control , 1997 .

[11] Yevgeniy Vorobeychik,et al. Multidefender Security Games , 2015, IEEE Intelligent Systems.

[12] Karl Tuyls,et al. Markov Security Games : Learning in Spatial Security Problems , 2016 .

[13] Martin L. Shooman,et al. Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design , 2002 .

[14] Shimon Whiteson,et al. A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[15] Craig Boutilier,et al. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[16] Vincent Conitzer,et al. Stackelberg vs. Nash in Security Games: An Extended Investigation of Interchangeability, Equivalence, and Uniqueness , 2011, J. Artif. Intell. Res..

[17] Sui Ruan,et al. Patrolling in a Stochastic Environment , 2005 .

[18] E. Zeidler,et al. Fixed-point theorems , 1986 .