Consider a loss function L = ∑n i=1 ` 2 i with `i = f(xi) − yi, where f(x) is a deep feedforward network with R layers, no bias terms and scalar output. Assume the network is overparametrized that is, d >> n, where d is the number of parameters and n is the number of data points. The networks are assumed to interpolate the training data (e.g. the minimum of L is zero). If GD converges, it will converge to a critical point of L, namely a solution of ∑n i=1 `i∇`i = 0. There are two kinds of critical points those for which each term of the above sum vanishes individually, and those for which the expression only vanishes when all the terms are summed. The main claim in this note is that while GD can converge to both types of critical points, SGD can only converge to the first kind, which include all global minima. We review other properties of the loss landscape: • As shown rigorously by [1] for the case of smooth RELUs the global minima in the W s, when not empty, are highly degenerate with dimension d − n and for them `i = 0 ∀i = 1, · · · , N (see also [2]). • Under additional assumptions all of the global minima are connected within a unique and potentially very large global valley ([3], based on [4]).
[1]
Quynh Nguyen,et al.
On Connected Sublevel Sets in Deep Learning
,
2019,
ICML.
[2]
Tomaso A. Poggio,et al.
Theory II: Landscape of the Empirical Risk in Deep Learning
,
2017,
ArXiv.
[3]
Zhanxing Zhu,et al.
Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes
,
2017,
ArXiv.
[4]
Yaim Cooper,et al.
The loss landscape of overparameterized neural networks
,
2018,
ArXiv.
[5]
Tomaso A. Poggio,et al.
Fisher-Rao Metric, Geometry, and Complexity of Neural Networks
,
2017,
AISTATS.
[6]
Zhanxing Zhu,et al.
The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects
,
2018,
ICML.