On Symmetry and Initialization for Neural Networks

This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.

[1]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[2]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[3]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[4]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[5]  Meng Yang,et al.  Large-Margin Softmax Loss for Convolutional Neural Networks , 2016, ICML.

[6]  J. Håstad Computational limitations of small-depth circuits , 1987 .

[7]  Marvin Minsky,et al.  Perceptrons: expanded edition , 1988 .

[8]  Hossein Mobahi,et al.  Large Margin Deep Networks for Classification , 2018, NeurIPS.

[9]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[10]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[11]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[12]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2016, IEEE Transactions on Signal Processing.

[13]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[14]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[15]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[16]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[17]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[18]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[19]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[20]  Michael Sipser,et al.  Parity, circuits, and the polynomial-time hierarchy , 1981, 22nd Annual Symposium on Foundations of Computer Science (sfcs 1981).

[21]  Matus Telgarsky,et al.  Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[22]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[23]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[24]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[25]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[26]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[27]  Marat Z. Arslanov,et al.  N-bit Parity Neural Networks with minimum number of threshold neurons , 2016 .

[28]  E. Romero,et al.  Maximizing the margin with feedforward neural networks , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[29]  Bogdan M. Wilamowski,et al.  Solving parity-N problems with feedforward neural networks , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[30]  Pedro M. Domingos,et al.  Deep Symmetry Networks , 2014, NIPS.

[31]  Tie-Yan Liu,et al.  Large Margin Deep Neural Networks: Theory and Algorithms , 2015, ArXiv.

[32]  Guillermo Sapiro,et al.  Margin Preservation of Deep Neural Networks , 2016, ArXiv.

[33]  M. Z. Arslanov,et al.  N-bit parity ordered neural networks , 2002, Neurocomputing.

[34]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[35]  Miklós Ajtai,et al.  ∑11-Formulae on finite structures , 1983, Ann. Pure Appl. Log..

[36]  Le Song,et al.  On the Complexity of Learning Neural Networks , 2017, NIPS.

[37]  Kaoru Hirota,et al.  A Solution for the N-bit Parity Problem Using a Single Translated Multiplicative Neuron , 2004, Neural Processing Letters.