Learning the Arrow of Time for Problems in Reinforcement Learning

We humans have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we approach the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture salient information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. Finally, we propose a simple yet effective algorithm to parameterize the problem at hand and learn an arrow of time with a function approximator (here, a deep neural network). Our empirical results span a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a well known notion of an arrow of time due to Jordan, Kinderlehrer and Otto (1998).

[1]  Marc Pollefeys,et al.  Episodic Curiosity through Reachability , 2018, ICLR.

[2]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[3]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[4]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[5]  Bernhard Schölkopf,et al.  Detecting the direction of causal time series , 2009, ICML '09.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Aram Galstyan,et al.  Efficient Estimation of Mutual Information for Strongly Dependent Variables , 2014, AISTATS.

[8]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[9]  U. Seifert Stochastic thermodynamics, fluctuation theorems and molecular machines , 2012, Reports on progress in physics. Physical Society.

[10]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[11]  Bernhard Schölkopf,et al.  Seeing the Arrow of Time , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Laurent Orseau,et al.  Measuring and avoiding side effects using relative reachability , 2018, ArXiv.

[13]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  A. Eddington,et al.  The nature of the physical world, by A.S. Eddington ... , 1928 .

[15]  B. Schoelkopf,et al.  Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference , 2015, 1512.02057.

[16]  Andrew Zisserman,et al.  Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[18]  Stefan Bauer,et al.  The Arrow of Time in Multivariate Time Series , 2016, ICML.

[19]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[20]  Yoshua Bengio,et al.  Unsupervised State Representation Learning in Atari , 2019, NeurIPS.

[21]  J. Willems Dissipative dynamical systems part I: General theory , 1972 .

[22]  Ankush Gupta,et al.  Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.

[23]  Zurek,et al.  Algorithmic randomness and physical entropy. , 1989, Physical review. A, General physics.

[24]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[25]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[26]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[27]  Dominik Janzing,et al.  On the Entropy Production of Time Series with Unidirectional Linearity , 2009, 0908.1861.

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Sergey Levine,et al.  Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning , 2017, ICLR.

[30]  Stuart Armstrong,et al.  Low Impact Artificial Intelligences , 2017, ArXiv.

[31]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[32]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[33]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.