Memory Bounds for Continual Learning

Continual learning, or lifelong learning, is a formidable current challenge to machine learning. It requires the learner to solve a sequence of k different learning tasks, one after the other, while retaining its aptitude for earlier tasks; the continual learner should scale better than the obvious solution of developing and maintaining a separate learner for each of the k tasks. We embark on a complexity-theoretic study of continual learning in the PAC framework. We make novel uses of communication complexity to establish that any continual learner, even an improper one, needs memory that grows linearly with k , strongly suggesting that the problem is intractable. When logarithmically many passes over the learning tasks are allowed, we provide an algorithm based on multiplicative weights update whose memory requirement scales well; we also establish that improper learning is necessary for such performance. We conjecture that these results may lead to new promising approaches to continual learning.

[1]  Noam Nisan,et al.  Rounds in communication complexity revisited , 1991, STOC '91.

[2]  Santosh S. Vempala,et al.  Efficient Representations for Lifelong Learning and Autoencoding , 2014, COLT.

[3]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[4]  Andrej Risteski,et al.  Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions , 2022, NeurIPS.

[5]  Ariel D. Procaccia,et al.  Collaborative PAC Learning , 2017, NIPS.

[6]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[7]  A. Razborov Communication Complexity , 2011 .

[8]  Philip H. S. Torr,et al.  Continual Learning in Low-rank Orthogonal Subspaces , 2020, NeurIPS.

[9]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[10]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[11]  Ruth Urner,et al.  Lifelong Learning with Weighted Majority Votes , 2016, NIPS.

[12]  Tom Diethe,et al.  Optimal Continual Learning has Perfect Memory and is NP-hard , 2020, ICML.

[13]  Ran Raz,et al.  Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[14]  Vitaly Feldman,et al.  Sample Complexity Bounds on Differentially Private Learning via Communication Complexity , 2014, SIAM J. Comput..

[15]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[16]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[17]  Vitaly Feldman,et al.  When is memorization of irrelevant training data necessary for high-accuracy learning? , 2020, STOC.

[18]  Ran Raz,et al.  Extractor-based time-space lower bounds for learning , 2017, Electron. Colloquium Comput. Complex..

[19]  Xi Chen,et al.  How to compress interactive communication , 2010, STOC '10.

[20]  Maria-Florina Balcan,et al.  Distributed Learning, Communication Complexity and Privacy , 2012, COLT.

[21]  Marc'Aurelio Ranzato,et al.  Efficient Lifelong Learning with A-GEM , 2018, ICLR.

[22]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[23]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[24]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[25]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[26]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[27]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[28]  Huy L. Nguyen,et al.  Improved Algorithms for Collaborative PAC Learning , 2018, NeurIPS.

[29]  Scott Aaronson,et al.  Limitations of quantum advice and one-way communication , 2004, Proceedings. 19th IEEE Annual Conference on Computational Complexity, 2004..

[30]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[31]  Amir Yehudayoff,et al.  Pointer chasing via triangular discrimination , 2020, Combinatorics, Probability and Computing.

[32]  Mark Braverman,et al.  Near Optimal Distributed Learning of Halfspaces with Two Parties , 2021, COLT.

[33]  Hava T. Siegelmann,et al.  Brain-inspired replay for continual learning with artificial neural networks , 2020, Nature Communications.

[34]  Mehrdad Farajtabar,et al.  Orthogonal Gradient Descent for Continual Learning , 2019, AISTATS.

[35]  S. Vempala,et al.  Provable Lifelong Learning of Representations , 2021, AISTATS.

[36]  Mark Braverman,et al.  Information Equals Amortized Communication , 2011, IEEE Transactions on Information Theory.

[37]  Eric Eaton,et al.  ELLA: An Efficient Lifelong Learning Algorithm , 2013, ICML.

[38]  Ohad Shamir,et al.  Space lower bounds for linear prediction in the streaming model , 2019, COLT.

[39]  Yuan Zhou,et al.  Tight Bounds for Collaborative PAC Learning via Multiplicative Weights , 2018, NeurIPS.

[40]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[41]  Mark Braverman Interactive information complexity , 2012, STOC '12.

[42]  Amos Beimel,et al.  Characterizing the Sample Complexity of Pure Private Learners , 2019, J. Mach. Learn. Res..

[43]  Rahul Jain,et al.  A Direct Product Theorem for Two-Party Bounded-Round Public-Coin Communication Complexity , 2016, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[44]  Hartmut Klauck,et al.  On quantum and probabilistic communication: Las Vegas and one-way protocols , 2000, STOC '00.

[45]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[46]  Vatsal Sharan,et al.  Memory-sample tradeoffs for linear regression with small error , 2019, STOC.

[47]  Roi Livni,et al.  On Communication Complexity of Classification Problems , 2017, Electron. Colloquium Comput. Complex..

[48]  S. Shelah A combinatorial problem; stability and order for models and theories in infinitary languages. , 1972 .

[49]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[50]  The Communication Complexity of Optimization , 2019, SODA.

[51]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[52]  David Rolnick,et al.  Experience Replay for Continual Learning , 2018, NeurIPS.