Manipulating Reinforcement Learning: Poisoning Attacks on Cost Signals

This chapter studies emerging cyber-attacks on reinforcement learning (RL) and introduces a quantitative approach to analyze the vulnerabilities of RL. Focusing on adversarial manipulation on the cost signals, we analyze the performance degradation of TD($\lambda$) and $Q$-learning algorithms under the manipulation. For TD($\lambda$), the approximation learned from the manipulated costs has an approximation error bound proportional to the magnitude of the attack. The effect of the adversarial attacks on the bound does not depend on the choice of $\lambda$. In $Q$-learning, we show that $Q$-learning algorithms converge under stealthy attacks and bounded falsifications on cost signals. We characterize the relation between the falsified cost and the $Q$-factors as well as the policy learned by the learning agent which provides fundamental limits for feasible offensive and defensive moves. We propose a robust region in terms of the cost within which the adversary can never achieve the targeted policy. We provide conditions on the falsified cost which can mislead the agent to learn an adversary's favored policy. A case study of TD($\lambda$) learning is provided to corroborate the results.

[1]  E. Cheney Analysis for Applied Mathematics , 2001 .

[2]  Quanyan Zhu,et al.  Dynamic policy-based IDS configuration , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[3]  Ah-Hwee Tan,et al.  Integrating Temporal Difference Methods and Self-Organizing Neural Networks for Reinforcement Learning With Delayed Evaluative Feedback , 2008, IEEE Transactions on Neural Networks.

[4]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5]  Arslan Munir,et al.  Adversarial Reinforcement Learning Framework for Benchmarking Collision Avoidance Mechanisms in Autonomous Vehicles , 2018, IEEE Intelligent Transportation Systems Magazine.

[6]  D. Ernst,et al.  Power systems stability control: reinforcement learning framework , 2004, IEEE Transactions on Power Systems.

[7]  Quanyan Zhu,et al.  A cyber-physical game framework for secure and resilient multi-agent autonomous systems , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[8]  Ming-Yu Liu,et al.  Tactics of Adversarial Attack on Deep Reinforcement Learning Agents , 2017, IJCAI.

[9]  Laurent Orseau,et al.  Reinforcement Learning with a Corrupted Reward Channel , 2017, IJCAI.

[10]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[11]  Quanyan Zhu,et al.  Adaptive Honeypot Engagement through Reinforcement Learning of Semi-Markov Decision Processes , 2019, GameSec.

[12]  Xiaojin Zhu,et al.  Policy Poisoning in Batch Reinforcement Learning and Control , 2019, NeurIPS.

[13]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[14]  Arslan Munir,et al.  The Faults in Our Pi Stars: Security Issues and Open Challenges in Deep Reinforcement Learning , 2018, ArXiv.

[15]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[16]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[17]  E. Kreyszig Introductory Functional Analysis With Applications , 1978 .

[18]  T. Urbanik,et al.  Reinforcement learning-based multi-agent system for network traffic signal control , 2010 .

[19]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[20]  Quanyan Zhu,et al.  Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals , 2019, GameSec.

[21]  Bo Li,et al.  Reinforcement Learning with Perturbed Rewards , 2018, AAAI.

[22]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[23]  Balsman,et al.  The Theorems of the Alternative , 1991 .

[24]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[25]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[26]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.