EMORY-BASED P ARAMETER A DAPTATION

Deep neural networks have excelled on a wide range of problems, from vision to language and game playing. Neural networks very gradually incorporate information into weights as they process data, requiring very low learning rates. If the training distribution shifts, the network is slow to adapt, and when it does adapt, it typically performs badly on the training distribution before the shift. Our method, Memory-based Parameter Adaptation, stores examples in memory and then uses a context-based lookup to directly modify the weights of a neural network. Much higher learning rates can be used for this local adaptation, reneging the need for many iterations over similar data before good predictions can be made. As our method is memory-based, it alleviates several shortcomings of neural networks, such as catastrophic forgetting, fast, stable acquisition of new knowledge, learning with an imbalanced class labels, and fast learning during evaluation. We demonstrate this on a range of supervised tasks: large-scale image classification and language modelling.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Haim H. Permuter,et al.  Gradual Learning of Deep Recurrent Neural Networks , 2017, ArXiv.

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Yong Wang,et al.  Search Engine Guided Non-Parametric Neural Machine Translation , 2017, ArXiv.

[6]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[9]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[10]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Oriol Vinyals,et al.  Bayesian Recurrent Neural Networks , 2017, ArXiv.

[13]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[14]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[15]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[16]  Aurko Roy,et al.  Learning to Remember Rare Events , 2017, ICLR.

[17]  Joel Z. Leibo,et al.  Approximate Hubel-Wiesel Modules and the Data Structures of Neural Computation , 2015, ArXiv.

[18]  James L. McClelland,et al.  What Learning Systems do Intelligent Agents Need? Complementary Learning Systems Theory Updated , 2016, Trends in Cognitive Sciences.

[19]  R. French Catastrophic Forgetting in Connectionist Networks , 2006 .

[20]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Laurent Itti,et al.  Active Long Term Memory Networks , 2016, ArXiv.

[22]  Tomaso Poggio,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2013, 1311.4158.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Jiajun Zhang,et al.  One Sentence One Model for Neural Machine Translation , 2018, LREC.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Guohua Pan,et al.  Local Regression and Likelihood , 1999, Technometrics.

[27]  Joel Z. Leibo,et al.  Model-Free Episodic Control , 2016, ArXiv.

[28]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[29]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[30]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[31]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[32]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[33]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[34]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[35]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[36]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[37]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[38]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[39]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[40]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[41]  Steve Renals,et al.  Dynamic Evaluation of Neural Sequence Models , 2017, ICML.