Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a temporally symmetric trade-off between transfer and interference that can be optimized by enforcing gradient alignment across examples. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.

[1]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[2]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[3]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[4]  Philip H. S. Torr,et al.  Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[5]  M. Riemer,et al.  Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning , 2017, arXiv.org.

[6]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[7]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[8]  Jacob M.J. Murre,et al.  Learning and Categorization in Modular Neural Networks , 1992 .

[9]  E. Bizzi,et al.  A theory for how sensorimotor skills are learned and retained in noisy and nonstationary neural circuits , 2013, Proceedings of the National Academy of Sciences.

[10]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[11]  Richard Hull,et al.  Correcting Forecasts with Multifactor Neural Attention , 2016, ICML.

[12]  Jürgen Schmidhuber,et al.  Optimal Ordered Problem Solver , 2002, Machine Learning.

[13]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[14]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[15]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[16]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[17]  Michele Franceschini,et al.  Generation and Consolidation of Recollections for Efficient Deep Lifelong Learning , 2017, ArXiv.

[18]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[19]  Richard S. Sutton,et al.  A Deeper Look at Experience Replay , 2017, ArXiv.

[20]  Tinne Tuytelaars,et al.  Expert Gate: Lifelong Learning with a Network of Experts , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jürgen Schmidhuber,et al.  PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem , 2011, Front. Psychol..

[22]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[23]  Surya Ganguli,et al.  A memory frontier for complex synapses , 2013, NIPS.

[24]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[25]  L’oubli catastrophique it,et al.  Avoiding catastrophic forgetting by coupling two reverberating neural networks , 2004 .

[26]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[27]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sebastian Thrun,et al.  A lifelong learning perspective for mobile robot control , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[29]  Sung Ju Hwang,et al.  Lifelong Learning with Dynamically Expandable Networks , 2017, ICLR.

[30]  Anthony V. Robins,et al.  Catastrophic Forgetting, Rehearsal and Pseudorehearsal , 1995, Connect. Sci..

[31]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[32]  Joshua B. Tenenbaum,et al.  One shot learning of simple visual concepts , 2011, CogSci.

[33]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[34]  Shie Mannor,et al.  Learning Robust Options , 2018, AAAI.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[37]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[38]  Robert M. French,et al.  Using Semi-Distributed Representations to Overcome Catastrophic Forgetting in Connectionist Networks , 1991 .

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  Alexandros Karatzoglou,et al.  Overcoming Catastrophic Forgetting with Hard Attention to the Task , 2018 .

[41]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[43]  Djallel Bouneffouf,et al.  Scalable Recollections for Continual Lifelong Learning , 2017, AAAI.

[44]  Geoffrey E. Hinton Using fast weights to deblur old memories , 1987 .

[45]  Sophia Krasikov,et al.  A Deep Learning and Knowledge Transfer Based Architecture for Social Media User Characteristic Determination , 2015, SocialNLP@NAACL.

[46]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[47]  Benjamin Frederick Goodrich,et al.  Neuron Clustering for Mitigating Catastrophic Forgetting in Supervised and Reinforcement Learning , 2015 .

[48]  Gerald Tesauro,et al.  Learning Abstract Options , 2018, NeurIPS.

[49]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[50]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[51]  L. Abbott,et al.  Cascade Models of Synaptically Stored Memories , 2005, Neuron.

[52]  M. Franceschini,et al.  Generative Knowledge Distillation for General Purpose Function Compression , 2017 .

[53]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[54]  Sergey Levine,et al.  Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm , 2017, ICLR.

[55]  Yongxin Yang,et al.  Deep Multi-task Representation Learning: A Tensor Factorisation Approach , 2016, ICLR.

[56]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[57]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[58]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).