论文信息 - The Multilinear Structure of ReLU Networks

The Multilinear Structure of ReLU Networks

We study the loss surface of neural networks equipped with a hinge loss criterion and ReLU or leaky ReLU nonlinearities. Any such network defines a piecewise multilinear form in parameter space. By appealing to harmonic analysis we show that all local minima of such network are non-differentiable, except for those minima that occur in a region of parameter space where the loss surface is perfectly flat. Non-differentiable minima are therefore not technicalities or pathologies; they are heart of the problem when investigating the loss of ReLU networks. As a consequence, we must employ techniques from nonsmooth analysis to study these loss surfaces. We show how to apply these techniques in some illustrative cases.

Thomas Laurent | James von Brecht | T. Laurent | J. V. Brecht

[1] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[2] Thomas Laurent,et al. Deep linear neural networks with arbitrary loss: All local minima are global , 2017, ArXiv.

[3] Kurt Hornik,et al. Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[4] Yuandong Tian,et al. An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[5] Pierre Baldi,et al. Complex-Valued Autoencoders , 2011, Neural Networks.

[6] A. Montanari,et al. The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[7] Yuandong Tian,et al. When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[8] Elad Hoffer,et al. Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[9] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[10] Shai Shalev-Shwartz,et al. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[11] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[12] Ohad Shamir,et al. On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[13] Paolo Frasconi,et al. Backpropagation for linearly-separable patterns: A detailed analysis , 1993, IEEE International Conference on Neural Networks.

[14] Mahdi Soltanolkotabi,et al. Learning ReLUs via Gradient Descent , 2017, NIPS.

[15] Alberto Tesi,et al. On the Problem of Local Minima in Backpropagation , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[16] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[17] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[18] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[19] Matthias Hein,et al. The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[20] Jiashi Feng,et al. The Landscape of Deep Learning Algorithms , 2017, ArXiv.

[21] J. Borwein,et al. Convex Analysis And Nonlinear Optimization , 2000 .

[22] X H Yu,et al. On the local minima free condition of backpropagation learning , 1995, IEEE Trans. Neural Networks.

[23] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.