Deep Neural Networks with Multi-Branch Architectures Are Intrinsically Less Non-Convex

Several recently proposed architectures of neural networks such as ResNeXt, Inception, Xception, SqueezeNet and Wide ResNet are based on the designing idea of having multiple branches and have demonstrated improved performance in many applications. We show that one cause for such success is due to the fact that the multi-branch architecture is less non-convex in terms of duality gap. The duality gap measures the degree of intrinsic non-convexity of an optimization problem: smaller gap in relative value implies lower degree of intrinsic non-convexity. The challenge is to quantitatively measure the duality gap of highly non-convex problems such as deep neural networks. In this work, we provide strong guarantees of this quantity for two classes of network architectures. For the neural networks with arbitrary activation functions, multi-branch architecture and a variant of hinge loss, we show that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases. This result sheds light on better understanding the power of over-parametrization where increasing the number of branches tends to make the loss surface less non-convex. For the neural networks with linear activation function and $\ell_2$ loss, we show that the duality gap of empirical risk is zero. Our two results work for arbitrary depths, while the analytical techniques might be of independent interest to non-convex optimization more broadly. Experiments on both synthetic and real-world datasets validate our results.

[1]  R. Starr Quasi-Equilibria in Markets with Non-Convex Preferences , 1969 .

[2]  M. Wagner,et al.  Generalized Linear Programming Solves the Dual , 1976 .

[3]  D. Bertsekas,et al.  Estimates of the duality gap for large-scale separable nonconvex optimization problems , 1982, 1982 21st IEEE Conference on Decision and Control.

[4]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[5]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[6]  Michael L. Overton,et al.  On the Sum of the Largest Eigenvalues of a Symmetric Matrix , 1992, SIAM J. Matrix Anal. Appl..

[7]  Hava T. Siegelmann,et al.  On the complexity of training neural networks with continuous activation functions , 1995, IEEE Trans. Neural Networks.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  P. Bartlett,et al.  Hardness results for neural network approximation problems , 1999, Theor. Comput. Sci..

[10]  Yonina C. Eldar,et al.  Strong Duality in Nonconvex Quadratic Optimization with Two Quadratic Constraints , 2006, SIAM J. Optim..

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  Bingsheng He,et al.  On the O(1/n) Convergence Rate of the Douglas-Rachford Alternating Direction Method , 2012, SIAM J. Numer. Anal..

[13]  Pierre Baldi,et al.  Complex-Valued Autoencoders , 2011, Neural Networks.

[14]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[15]  Xinhua Zhang,et al.  Convex Deep Learning via Normalized Kernels , 2014, NIPS.

[16]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[17]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[18]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[19]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[20]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[23]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[24]  Maria-Florina Balcan,et al.  Learning and 1-bit Compressed Sensing under Asymmetric Noise , 2016, COLT.

[25]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[26]  A. Tang,et al.  Refined Shapely-Folkman Lemma and Its Application in Duality Gap Estimation , 2016, 1610.05416.

[27]  Stephen P. Boyd,et al.  Generalized Low Rank Models , 2014, Found. Trends Mach. Learn..

[28]  Anima Anandkumar,et al.  Efficient approaches for escaping higher order saddle points in non-convex optimization , 2016, COLT.

[29]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[30]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[31]  Stephen P. Boyd,et al.  Bounding duality gap for separable problems with linear constraints , 2014, Comput. Optim. Appl..

[32]  René Vidal,et al.  Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[34]  Joan Bruna,et al.  Mathematics of Deep Learning , 2017, ArXiv.

[35]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[36]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Martin J. Wainwright,et al.  Convexified Convolutional Neural Networks , 2016, ICML.

[38]  Maria-Florina Balcan,et al.  Sample and Computationally Efficient Learning Algorithms under S-Concave Distributions , 2017, NIPS.

[39]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[40]  A. d'Aspremont,et al.  An Approximate Shapley-Folkman Theorem , 2017, 1712.08559.

[41]  Martin J. Wainwright,et al.  On the Learnability of Fully-Connected Neural Networks , 2017, AISTATS.

[42]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[43]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[44]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[45]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[46]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[47]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[49]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[50]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[51]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[52]  R. Srikant,et al.  Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[53]  David P. Woodruff,et al.  Matrix Completion and Related Problems via Strong Duality , 2017, ITCS.

[54]  Le Song,et al.  Deep Semi-Random Features for Nonlinear Function Approximation , 2017, AAAI.

[55]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[56]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[57]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[58]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[59]  Anders Rantzer,et al.  Low-Rank Optimization With Convex Constraints , 2016, IEEE Transactions on Automatic Control.

[60]  Jiashi Feng,et al.  Empirical Risk Landscape Analysis for Understanding Deep Neural Networks , 2018, ICLR.

[61]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[62]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[63]  Pengtao Xie,et al.  Stackelberg GAN: Towards Provable Minimax Equilibrium via Multi-Generator Architectures , 2018, ArXiv.

[64]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[65]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[66]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[67]  Mengdi Wang,et al.  Blessing of massive scale: spatial graphical model estimation with a total cardinality constraint approach , 2018, Math. Program..

[68]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.