Deep Neural Networks with Multi-Branch Architectures Are Less Non-Convex

Several recently proposed architectures of neural networks such as ResNeXt, Inception, Xception, SqueezeNet and Wide ResNet are based on the designing idea of having multiple branches and have demonstrated improved performance in many applications. We show that one cause for such success is due to the fact that the multi-branch architecture is less non-convex in terms of duality gap. The duality gap measures the degree of intrinsic non-convexity of an optimization problem: smaller gap in relative value implies lower degree of intrinsic non-convexity. The challenge is to quantitatively measure the duality gap of highly non-convex problems such as deep neural networks. In this work, we provide strong guarantees of this quantity for two classes of network architectures. For the neural networks with arbitrary activation functions, multi-branch architecture and a variant of hinge loss, we show that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases. This result sheds light on better understanding the power of over-parametrization where increasing the network width tends to make the loss surface less non-convex. For the neural networks with linear activation function and $\ell_2$ loss, we show that the duality gap of empirical risk is zero. Our two results work for arbitrary depths and adversarial data, while the analytical techniques might be of independent interest to non-convex optimization more broadly. Experiments on both synthetic and real-world datasets validate our results.

[1]  A. d'Aspremont,et al.  An Approximate Shapley-Folkman Theorem , 2017, 1712.08559.

[2]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[3]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Stephen P. Boyd,et al.  Bounding duality gap for separable problems with linear constraints , 2014, Comput. Optim. Appl..

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[7]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[8]  René Vidal,et al.  Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[10]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[11]  David P. Woodruff,et al.  Matrix Completion and Related Problems via Strong Duality , 2017, ITCS.

[12]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[13]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[14]  Pierre Baldi,et al.  Complex-Valued Autoencoders , 2011, Neural Networks.

[15]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[16]  D. Bertsekas,et al.  Estimates of the duality gap for large-scale separable nonconvex optimization problems , 1982, 1982 21st IEEE Conference on Decision and Control.

[17]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[18]  Yonina C. Eldar,et al.  Strong Duality in Nonconvex Quadratic Optimization with Two Quadratic Constraints , 2006, SIAM J. Optim..

[19]  Hava T. Siegelmann,et al.  On the complexity of training neural networks with continuous activation functions , 1995, IEEE Trans. Neural Networks.

[20]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[21]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[22]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[23]  Joan Bruna,et al.  Mathematics of Deep Learning , 2017, ArXiv.

[24]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[25]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[26]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[27]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[28]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[29]  Shai Ben-David,et al.  Hardness Results for Neural Network Approximation Problems , 1999, EuroCOLT.

[30]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[31]  Bingsheng He,et al.  On the O(1/n) Convergence Rate of the Douglas-Rachford Alternating Direction Method , 2012, SIAM J. Numer. Anal..

[32]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[33]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[34]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[35]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[36]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[37]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[38]  Maria-Florina Balcan,et al.  Learning and 1-bit Compressed Sensing under Asymmetric Noise , 2016, COLT.

[39]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[40]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[41]  Stephen P. Boyd,et al.  Generalized Low Rank Models , 2014, Found. Trends Mach. Learn..

[42]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[43]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[44]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[45]  Anders Rantzer,et al.  Low-Rank Optimization With Convex Constraints , 2016, IEEE Transactions on Automatic Control.

[46]  A. Tang,et al.  Refined Shapely-Folkman Lemma and Its Application in Duality Gap Estimation , 2016, 1610.05416.

[47]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[48]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[49]  Anima Anandkumar,et al.  Efficient approaches for escaping higher order saddle points in non-convex optimization , 2016, COLT.

[50]  Mengdi Wang,et al.  Blessing of massive scale: spatial graphical model estimation with a total cardinality constraint approach , 2018, Math. Program..

[51]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[52]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[54]  Jiashi Feng,et al.  Empirical Risk Landscape Analysis for Understanding Deep Neural Networks , 2018, ICLR.

[55]  Xinhua Zhang,et al.  Convex Deep Learning via Normalized Kernels , 2014, NIPS.