Continual Learning from the Perspective of Compression

Connectionist models such as neural networks suffer from catastrophic forgetting. In this work, we study this problem from the perspective of information theory and define forgetting as the increase of description lengths of previous data when they are compressed with a sequentially learned model. In addition, we show that continual learning approaches based on variational posterior approximation and generative replay can be considered as approximations to two prequential coding methods in compression, namely, the Bayesian mixture code and maximum likelihood (ML) plug-in code. We compare these approaches in terms of both compression and forgetting and empirically study the reasons that limit the performance of continual learning methods based on variational posterior approximation. To address these limitations, we propose a new continual learning method that combines ML plug-in and Bayesian mixture codes.

[1]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[2]  Yee Whye Teh,et al.  Continual Unsupervised Representation Learning , 2019, NeurIPS.

[3]  Trevor Darrell,et al.  Uncertainty-guided Continual Learning with Bayesian Neural Networks , 2019, ICLR.

[4]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[5]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[6]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[7]  Yarin Gal,et al.  A Unifying Bayesian View of Continual Learning , 2019, ArXiv.

[8]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[9]  Yarin Gal,et al.  Towards Robust Evaluations of Continual Learning , 2018, ArXiv.

[10]  Mark B. Ring Child: A First Step Towards Continual Learning , 1998, Learning to Learn.

[11]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[12]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[13]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[14]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[15]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[16]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[17]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[18]  Richard E. Turner,et al.  Improving and Understanding Variational Continual Learning , 2019, ArXiv.

[19]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[20]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[21]  Yann Ollivier,et al.  Auto-encoders: reconstruction versus compression , 2014, ArXiv.

[22]  David Filliat,et al.  Generative Models from the perspective of Continual Learning , 2018, 2019 International Joint Conference on Neural Networks (IJCNN).

[23]  Sebastian Thrun,et al.  Learning to Learn , 1998, Springer US.

[24]  Bogdan Raducanu,et al.  Memory Replay GANs: Learning to Generate New Categories without Forgetting , 2018, NeurIPS.

[25]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[26]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[27]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[28]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[29]  Yann Ollivier,et al.  The Description Length of Deep Learning models , 2018, NeurIPS.

[30]  G. Chaitin On the intelligibility of the universe and the notions of simplicity, complexity and irreducibility , 2002, math/0210035.

[31]  Alexandre Lacoste,et al.  Bayesian Hypernetworks , 2017, ArXiv.

[32]  Manfred Opper,et al.  A Bayesian approach to on-line learning , 1999 .