Detecting Spiky Corruption in Markov Decision Processes

Current reinforcement learning methods fail if the reward function is imperfect, i.e. if the agent observes reward different from what it actually receives. We study this problem within the formalism of Corrupt Reward Markov Decision Processes (CRMDPs). We show that if the reward corruption in a CRMDP is sufficiently "spiky", the environment is solvable. We fully characterize the regret bound of a Spiky CRMDP, and introduce an algorithm that is able to detect its corrupt states. We show that this algorithm can be used to learn the optimal policy with any common reinforcement learning algorithm. Finally, we investigate our algorithm in a pair of simple gridworld environments, finding that our algorithm can detect the corrupt states and learn the optimal policy despite the corruption.

[1]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[2]  Sergey Levine,et al.  Generalizing Skills with Semi-Supervised Reinforcement Learning , 2016, ICLR.

[3]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[4]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[5]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[6]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[7]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[8]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[9]  Javier García,et al.  Safe Exploration of State and Action Spaces in Reinforcement Learning , 2012, J. Artif. Intell. Res..

[10]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[11]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[12]  Nello Cristianini,et al.  Neural Information Processing Systems (NIPS) , 2003 .

[13]  Warren B. Powell,et al.  Reinforcement Learning and Its Relationship to Supervised Learning , 2004 .

[14]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[15]  Matthew E. Taylor,et al.  Metric learning for reinforcement learning agents , 2011, AAMAS.

[16]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[17]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[18]  Laurent Orseau,et al.  Reinforcement Learning with a Corrupted Reward Channel , 2017, IJCAI.