Learning Retrospective Knowledge with Reverse Reinforcement Learning

We present a Reverse Reinforcement Learning (Reverse RL) approach for representing retrospective knowledge. General Value Functions (GVFs) have enjoyed great success in representing predictive knowledge, i.e., answering questions about possible future outcomes such as "how much fuel will be consumed in expectation if we drive from A to B?". GVFs, however, cannot answer questions like "how much fuel do we expect a car to have given it is at B at time $t$?". To answer this question, we need to know when that car had a full tank and how that car came to B. Since such questions emphasize the influence of possible past events on the present, we refer to their answers as retrospective knowledge. In this paper, we show how to represent retrospective knowledge with Reverse GVFs, which are trained via Reverse RL. We demonstrate empirically the utility of Reverse GVFs in both representation learning and anomaly detection.

[1]  R. Sutton The Grand Challenge of Predictive Empirical Abstract Knowledge , 2009 .

[2]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[3]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4]  Martha White,et al.  An Off-policy Policy Gradient Theorem Using Emphatic Weightings , 2018, NeurIPS.

[5]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Sergey Levine,et al.  Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[10]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[11]  Tao Wang,et al.  Stable Dual Dynamic Programming , 2007, NIPS.

[12]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[13]  Vipin Kumar,et al.  Anomaly Detection for Discrete Sequences: A Survey , 2012, IEEE Transactions on Knowledge and Data Engineering.

[14]  Erin J. Talvitie,et al.  Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models , 2020, ArXiv.

[15]  V. Climenhaga Markov chains and mixing times , 2013 .

[16]  Joelle Pineau,et al.  Constrained Markov Decision Processes via Backward Value Functions , 2020, ICML.

[17]  Shimon Whiteson,et al.  GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values , 2020, ICML.

[18]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[19]  Junichiro Yoshimoto,et al.  Derivatives of Logarithmic Stationary Distributions for Policy Gradient Reinforcement Learning , 2010, Neural Computation.

[20]  Tao Wang,et al.  Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[21]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[22]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[23]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[24]  Richard S. Sutton,et al.  On Generalized Bellman Equations and Temporal-Difference Learning , 2017, Canadian Conference on AI.

[25]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[27]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[28]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[29]  Yee Whye Teh,et al.  An Analysis of Categorical Distributional Reinforcement Learning , 2018, AISTATS.

[30]  Sanjay Chawla,et al.  Deep Learning for Anomaly Detection: A Survey , 2019, ArXiv.

[31]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[32]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[33]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[34]  Nicolas Le Roux,et al.  A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[35]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[36]  Shimon Whiteson,et al.  Generalized Off-Policy Actor-Critic , 2019, NeurIPS.

[37]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[38]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[39]  Dale Schuurmans,et al.  Reinforcement Ranking , 2013, ArXiv.

[40]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[41]  Jonathan Scholz,et al.  Generative predecessor models for sample-efficient imitation learning , 2019, ICLR.

[42]  Doina Precup,et al.  Forethought and Hindsight in Credit Assignment , 2020, NeurIPS.

[43]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[44]  Marc G. Bellemare,et al.  The Value-Improvement Path: Towards Better Representations for Reinforcement Learning , 2020, ArXiv.

[45]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[46]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[47]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[49]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[50]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[51]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[52]  Shimon Whiteson,et al.  Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation , 2019, ICML.

[53]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[54]  H. Robbins A Stochastic Approximation Method , 1951 .

[55]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .