Exact learning dynamics of deep linear networks with prior knowledge

Learning in deep neural networks is known to depend critically on the knowledge embedded in the initial network weights. However, few theoretical results have precisely linked prior knowledge to learning dynamics. Here we derive exact solutions to the dynamics of learning with rich prior knowledge in deep linear networks by generalising Fukumizu’s matrix Riccati solution [1]. We obtain explicit expressions for the evolving network function, hidden representational similarity, and neural tangent kernel over training for a broad class of initialisations and tasks. The expressions reveal a class of task-independent initialisations that radically alter learning dynamics from slow non-linear dynamics to fast exponential trajectories while converging to a global optimum with identical representational similarity, dissociating learning trajectories from the structure of initial internal representations. We characterise how network weights dynamically align with task structure, rigorously justifying why previous solutions successfully described learning from small initial weights without incorporating their fine-scale structure. Finally, we discuss the implications of these findings for continual learning, reversal learning and learning of structured knowledge. Taken together, our results provide a mathematical toolkit for understanding the impact of prior knowledge on deep learning.

[1]  Andrew M. Saxe,et al.  Maslow's Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation , 2022, ArXiv.

[2]  Andrew M. Saxe,et al.  Orthogonal representations for robust context-dependent task performance in brains and neural networks , 2022, Neuron.

[3]  C. Pehlevan,et al.  Neural Networks as Kernel Learners: The Silent Alignment Effect , 2021, ICLR.

[4]  Andrew M. Saxe,et al.  Continual Learning in the Teacher-Student Setup: Impact of Task Similarity , 2021, ICML.

[5]  Amir Globerson,et al.  A Theoretical Analysis of Fine-tuning with Linear Teachers , 2021, NeurIPS.

[6]  Andrew M. Saxe,et al.  Probing transfer learning with a model of synthetic correlated datasets , 2021, Mach. Learn. Sci. Technol..

[7]  Masato Okada,et al.  Statistical Mechanical Analysis of Catastrophic Forgetting in Continual Learning with Teacher and Student Networks , 2021, Journal of the Physical Society of Japan.

[8]  Pierre Alquier,et al.  A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix , 2020, AISTATS.

[9]  Dongsung Huh,et al.  Curvature-corrected learning dynamics in deep neural networks , 2020, ICML.

[10]  Michael I. Jordan,et al.  On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[11]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[12]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[13]  Florent Krzakala,et al.  Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup , 2019, NeurIPS.

[14]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[15]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[16]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[17]  Naftali Tishby,et al.  Machine learning and the physical sciences , 2019, Reviews of Modern Physics.

[18]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[19]  Jon Kleinberg,et al.  Transfusion: Understanding Transfer Learning for Medical Imaging , 2019, NeurIPS.

[20]  Wei Hu,et al.  Width Provably Matters in Optimization for Deep Linear Neural Networks , 2019, ICML.

[21]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[22]  Surya Ganguli,et al.  A mathematical theory of semantic development in deep neural networks , 2018, Proceedings of the National Academy of Sciences.

[23]  Christopher Summerfield,et al.  Comparing continual task learning in minds and machines , 2018, Proceedings of the National Academy of Sciences.

[24]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[25]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[26]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[27]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[28]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[29]  Tomaso A. Poggio,et al.  Theory IIIb: Generalization in Deep Networks , 2018, ArXiv.

[30]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[31]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[32]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[33]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2018, Neural Networks.

[34]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[35]  Surya Ganguli,et al.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[36]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[37]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[38]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[39]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[41]  James L. McClelland Incorporating rapid neocortical learning of new schema-consistent information into complementary learning systems theory. , 2013, Journal of experimental psychology. General.

[42]  Nart Bedin Atalay,et al.  Simulating probability learning and probabilistic reversal learning using the attention-gated reinforcement learning (AGREL) model , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[43]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[44]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[45]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[46]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[47]  G. Murphy,et al.  The Big Book of Concepts , 2002 .

[48]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[49]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[50]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[51]  Michael Biehl,et al.  Learning by on-line gradient descent , 1995 .

[52]  John B. Moore,et al.  Global analysis of Oja's flow for neural networks , 1994, IEEE Trans. Neural Networks.

[53]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[54]  Guilherme França,et al.  Understanding the Dynamics of Gradient Flow in Overparameterized Linear models , 2021, ICML.

[55]  Grant M. Rotskoff,et al.  Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks , 2018, NeurIPS.

[56]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[57]  Kenji Fukumizu,et al.  Effect of Batch Learning in Multilayer Neural Networks , 1998, ICONIP.

[58]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[59]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[60]  S. Carey Conceptual Change in Childhood , 1985 .