论文信息 - A note on Linear Bottleneck networks and their Transition to Multilinearity

A note on Linear Bottleneck networks and their Transition to Multilinearity

Randomly initialized wide neural networks transition to linear functions of weights as the width grows, in a ball of radius O (1) around initialization. A necessary condition for this result is that all layers of the network are wide enough, i.e., all widths tend to inﬁnity. However, the transition to linearity breaks down when this inﬁnite width assumption is violated. In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius O (1) around initialization. In general, for B − 1 bottleneck layers, the network is a degree B multilinear function of weights. Importantly, the degree only depends on the number of bottlenecks and not the total depth of the network. in each weight matrix. However, as a function of all weights together, it is a polynomial of degree equal to total number of layers. Our analysis of bottleneck networks shows that, as the width of non-bottleneck layers grows, the degree of this polynomial reduces to the number of bottleneck layers in the network plus 1, and becomes independent of the total number of layers. For wide networks without bottlenecks, this recovers the result of transition to linearity of wide neural networks. In our technical analysis, for a BNN with B − 1 bottleneck layers, we show that the spectral norm of the B + 1 st derivative of the network function with respect to parameters, scales as with 1 / √ m where m is the width of non bottleneck layers, but the spectral norm of the B th derivative is Ω(1) . As a result, when m goes to inﬁnity, the network function transitions to a B th order polynomial of the weights. We further strengthen this claim by showing this polynomial is in fact a multilinear function, where the network function is jointly linear in layer weights between consecutive bottleneck layers.

Mikhail Belkin | Libin Zhu | Parthe Pandit

[1] Dong Yu,et al. Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[2] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[3] Quanquan Gu,et al. An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[4] Wei Hu,et al. A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[5] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[6] Sanjeev Arora,et al. Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[7] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Caroline Uhler,et al. On Alignment in Deep Linear Neural Networks , 2020 .

[9] Mikhail Belkin,et al. On the linearity of large non-linear models: when and why the tangent kernel is constant , 2020, NeurIPS.

[10] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[11] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.