A note on Linear Bottleneck networks and their Transition to Multilinearity

Randomly initialized wide neural networks transition to linear functions of weights as the width grows, in a ball of radius O (1) around initialization. A necessary condition for this result is that all layers of the network are wide enough, i.e., all widths tend to infinity. However, the transition to linearity breaks down when this infinite width assumption is violated. In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius O (1) around initialization. In general, for B − 1 bottleneck layers, the network is a degree B multilinear function of weights. Importantly, the degree only depends on the number of bottlenecks and not the total depth of the network. in each weight matrix. However, as a function of all weights together, it is a polynomial of degree equal to total number of layers. Our analysis of bottleneck networks shows that, as the width of non-bottleneck layers grows, the degree of this polynomial reduces to the number of bottleneck layers in the network plus 1, and becomes independent of the total number of layers. For wide networks without bottlenecks, this recovers the result of transition to linearity of wide neural networks. In our technical analysis, for a BNN with B − 1 bottleneck layers, we show that the spectral norm of the B + 1 st derivative of the network function with respect to parameters, scales as with 1 / √ m where m is the width of non bottleneck layers, but the spectral norm of the B th derivative is Ω(1) . As a result, when m goes to infinity, the network function transitions to a B th order polynomial of the weights. We further strengthen this claim by showing this polynomial is in fact a multilinear function, where the network function is jointly linear in layer weights between consecutive bottleneck layers.

[1]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[2]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[3]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[4]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[5]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[6]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Caroline Uhler,et al.  On Alignment in Deep Linear Neural Networks , 2020 .

[9]  Mikhail Belkin,et al.  On the linearity of large non-linear models: when and why the tangent kernel is constant , 2020, NeurIPS.

[10]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[11]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[12]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[13]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Jianqing Fan,et al.  High-Dimensional Statistics , 2014 .

[16]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[17]  Wei Hu,et al.  Width Provably Matters in Optimization for Deep Linear Neural Networks , 2019, ICML.

[18]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[19]  Philip M. Long,et al.  Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks , 2018, Neural Computation.

[20]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[21]  Mikhail Belkin,et al.  Loss landscapes and optimization in over-parameterized non-linear systems and neural networks , 2020, Applied and Computational Harmonic Analysis.

[22]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[23]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[24]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[25]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[26]  Guilherme França,et al.  Understanding the Dynamics of Gradient Flow in Overparameterized Linear models , 2021, ICML.

[27]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.