From eye-blinks to state construction: Diagnostic benchmarks for online representation learning

Experiments in classical conditioning show that animals such as rabbits, pigeons, and dogs can make long temporal associations that enable multi-step prediction. To replicate this remarkable ability, an agent must construct an internal state representation that summarizes its interaction history. Recurrent neural networks can automatically construct state and learn temporal associations. But the current training methods are prohibitively expensive for online prediction -- continual learning on every time step -- which is the focus of this paper. To facilitate research in online prediction, we present three new diagnostic prediction problems inspired by classical-conditioning experiments. The proposed problems test the learning capabilities that animals readily exhibit and highlight the current recurrent learning methods' limitations. While the proposed problems are nontrivial, they are still amenable to extensive testing and analysis in the small-compute regime, thereby enabling researchers to study issues in isolation carefully, ultimately accelerating progress towards scalable online representation learning methods.

[1]  Xin Li,et al.  Training Recurrent Neural Networks Online by Learning Explicit State Variables , 2020, ICLR.

[2]  Allan R. Wagner,et al.  Expectancies and the Priming of STM , 2018 .

[3]  Martha White,et al.  Meta-descent for Online, Continual Prediction , 2019, AAAI.

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[6]  N. Schneiderman Interstimulus interval function of the nictitating membrane response of the rabbit under delay versus trace conditioning. , 1966 .

[7]  C. L. Hull The problem of stimulus equivalence in behavior theory. , 1939 .

[8]  A. Dickinson Contemporary Animal Learning Theory , 1981 .

[9]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract) , 2018, IJCAI.

[10]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[11]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[12]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[13]  Shimon Whiteson,et al.  Report on the 2008 Reinforcement Learning Competition , 2010, AI Mag..

[14]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[15]  Johan S. Obando-Ceron,et al.  Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research , 2020, ICML.

[16]  N. Mackintosh The psychology of animal learning , 1974 .

[17]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[18]  Chrissy M Chubala,et al.  Intertrial unconditioned stimuli differentially impact trace conditioning , 2017, Learning & behavior.

[19]  Richard S. Sutton,et al.  Stimulus Representation and the Timing of Reward-Prediction Errors in Models of the Dopamine System , 2008, Neural Computation.

[20]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[23]  Richard S. Sutton,et al.  Representation Search through Generate and Test , 2013, AAAI Workshop: Learning Rich Representations from Low-Level Sensors.

[24]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[25]  Howard Eichenbaum,et al.  The hippocampus, time, and memory across scales. , 2013, Journal of experimental psychology. General.

[26]  Pierre-Yves Oudeyer,et al.  How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments , 2018, ArXiv.

[27]  Yoshua Bengio,et al.  Conditioning and time representation in long short-term memory networks , 2013, Biological Cybernetics.

[28]  W. James The Principles of Psychology, Vol. I , 2008 .

[29]  Richard S. Sutton,et al.  A computational model of hippocampal function in trace conditioning , 2008, NIPS.

[30]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[31]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[32]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[33]  Richard S. Sutton,et al.  Learning to Predict Independent of Span , 2015, ArXiv.

[34]  Charles R. Gallistel,et al.  Memory and the Computational Brain: Why Cognitive Science will Transform Neuroscience , 2009 .

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[40]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[41]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[42]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[43]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[44]  D. Spalding The Principles of Psychology , 1873, Nature.

[45]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[46]  Joel Z. Leibo,et al.  Unsupervised Predictive Memory in a Goal-Directed Agent , 2018, ArXiv.

[47]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[48]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[49]  Elliot A. Ludvig,et al.  Evaluating the TD model of classical conditioning , 2012, Learning & behavior.

[50]  Yann Ollivier,et al.  Unbiased Online Recurrent Optimization , 2017, ICLR.

[51]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[52]  Justin A. Harris,et al.  Negative patterning is easier than a biconditional discrimination. , 2008, Journal of experimental psychology. Animal behavior processes.

[53]  André Luzardo,et al.  The Rescorla-Wagner Drift-Diffusion model , 2018 .

[54]  Larry Rudolph,et al.  Implementation Matters in Deep RL: A Case Study on PPO and TRPO , 2020, ICLR.

[55]  Joel Z. Leibo,et al.  Generalization of Reinforcement Learners with Working and Episodic Memory , 2019, NeurIPS.

[56]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[57]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..