Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations

In this paper, it is shown theoretically that spurious local minima are common for deep fully-connected networks and convolutional neural networks (CNNs) with piecewise linear activation functions and datasets that cannot be fitted by linear models. A motivating example is given to explain the reason for the existence of spurious local minima: each output neuron of deep fully-connected networks and CNNs with piecewise linear activations produces a continuous piecewise linear (CPWL) output, and different pieces of CPWL output can fit disjoint groups of data samples when minimizing the empirical risk. Fitting data samples with different CPWL functions usually results in different levels of empirical risk, leading to prevalence of spurious local minima. This result is proved in general settings with any continuous loss function. The main proof technique is to represent a CPWL function as a maximization over minimization of linear pieces. Deep ReLU networks are then constructed to produce these linear pieces and implement maximization and minimization operations.

[1]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[2]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[3]  Jonas Geiping,et al.  Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory , 2020, ICLR.

[4]  Leslie Pack Kaelbling,et al.  Elimination of All Bad Local Minima in Deep Learning , 2019, AISTATS.

[5]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[6]  Dmitry Yarotsky,et al.  Low-loss connection of weight vectors: distribution-based approaches , 2020, ICML.

[7]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[8]  Quynh Nguyen,et al.  On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[9]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[10]  David Rolnick,et al.  Deep ReLU Networks Have Surprisingly Few Activation Patterns , 2019, NeurIPS.

[11]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[12]  Ruoyu Sun,et al.  Spurious Local Minima Exist for Almost All Over-parameterized Neural Networks , 2019 .

[13]  Yi Zhou,et al.  Critical Points of Neural Networks: Analytical Forms and Landscape Properties , 2017, ArXiv.

[14]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[15]  L. Liu,et al.  On the Number of Linear Regions of Convolutional Neural Networks , 2020, ICML.

[16]  Jeffrey Pennington,et al.  The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[17]  Tristan Milne,et al.  Piecewise Strong Convexity of Neural Networks , 2018, NeurIPS.

[18]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[19]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[20]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[21]  Brett Larsen,et al.  No Spurious Local Minima in Deep Quadratic Networks , 2020, ArXiv.

[22]  Dacheng Tao,et al.  Piecewise linear activations substantially shape the loss surfaces of neural networks , 2020, ICLR.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Ohad Shamir,et al.  Are ResNets Provably Better than Linear Predictors? , 2018, NeurIPS.

[25]  Saber Salehkaleybar,et al.  Bounds on Over-Parameterization for Guaranteed Existence of Descent Paths in Shallow ReLU Networks , 2020, ICLR.

[26]  Joan Bruna,et al.  Spurious Valleys in Two-layer Neural Network Optimization Landscapes , 2018, 1802.06384.

[27]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[28]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[29]  Yoshua Bengio,et al.  Depth with Nonlinearity Creates No Bad Local Minima in ResNets , 2019, Neural Networks.

[30]  Li Zhang,et al.  Depth creates no more spurious local minima , 2019, ArXiv.

[31]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[32]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[33]  R. Srikant,et al.  Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[34]  Liangming Chen,et al.  Deforming the Loss Surface to Affect the Behaviour of the Optimizer , 2020, ArXiv.

[35]  Matthias Hein,et al.  On the loss landscape of a class of deep neural networks with no bad local valleys , 2018, ICLR.

[36]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[37]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[38]  Christian Tjandraatmadja,et al.  Bounding and Counting Linear Regions of Deep Neural Networks , 2017, ICML.

[39]  David Rolnick,et al.  Complexity of Linear Regions in Deep Networks , 2019, ICML.

[40]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[41]  Franco Scarselli,et al.  On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[42]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[43]  David Tse,et al.  Porcupine Neural Networks: (Almost) All Local Optima are Global , 2017, ArXiv.

[44]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[45]  Michael Wand,et al.  Ringing ReLUs: Harmonic Distortion Analysis of Nonlinear Feedforward Networks , 2021, ICLR.

[46]  Yi Zhou,et al.  SGD Converges to Global Minimum in Deep Learning via Star-convex Path , 2019, ICLR.

[47]  J. M. Tarela,et al.  Region configurations for realizability of lattice Piecewise-Linear models , 1999 .

[48]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[49]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[50]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[51]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[52]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[53]  Pramod Viswanath,et al.  Learning One-hidden-layer Neural Networks under General Input Distributions , 2018, AISTATS.

[54]  Razvan Pascanu,et al.  Local minima in training of deep networks , 2017, ArXiv.

[55]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[56]  Dawei Li,et al.  The Global Landscape of Neural Networks: An Overview , 2020, IEEE Signal Processing Magazine.

[57]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[58]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[59]  Meisam Razaviyayn,et al.  Learning Deep Models: Critical Points and Local Openness , 2018, ICLR.

[60]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[61]  Jiashi Feng,et al.  Empirical Risk Landscape Analysis for Understanding Deep Neural Networks , 2018, ICLR.

[62]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[63]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[64]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[65]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[66]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[67]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.