论文信息 - Convergence Properties of Deep Neural Networks on Separable Data - 字舞流文

Convergence Properties of Deep Neural Networks on Separable Data

While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features’ frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.

Yoshua Bengio | Aaron C. Courville | Aaron Courville | Mohammad Pezeshki | Samira Shabanian | Remi Tachet des Combes | Yoshua Bengio | M. Pezeshki | Samira Shabanian | Rémi Tachet des Combes

[1] Morris Tenenbaum,et al. Ordinary differential equations : an elementary textbook for students of mathematics, engineering, and the sciences , 1963 .

[2] Kurt Hornik,et al. Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[3] Hilbert J. Kappen,et al. On-line learning processes in artificial neural networks , 1993 .

[4] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[5] T. Poggio,et al. General conditions for predictivity in learning theory , 2004, Nature.

[6] Lorenzo Rosasco,et al. Are Loss Functions All the Same? , 2004, Neural Computation.

[7] Surya Ganguli,et al. Learning hierarchical categories in deep neural networks , 2013, CogSci.

[8] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[9] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[10] Yann LeCun,et al. The Loss Surface of Multilayer Networks , 2014, ArXiv.

[11] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[12] Ryota Tomioka,et al. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement , 2017, ArXiv.

[16] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[17] Christopher Joseph Pal,et al. On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[18] Surya Ganguli,et al. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[19] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[20] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[21] Yuanzhi Li,et al. An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[22] Yi Zhou,et al. Convergence of SGD in Learning ReLU Models with Separable Data , 2018, ArXiv.

[23] Yuichi Yoshida,et al. Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[24] Zhenyu Liao,et al. The Dynamics of Learning: A Random Matrix Approach , 2018, ICML.

[25] Colin Raffel,et al. Is Generator Conditioning Causally Related to GAN Performance? , 2018, ICML.

[26] Sanjeev Arora,et al. On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[27] Nathan Srebro,et al. Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[28] Chico Q. Camargo,et al. Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[29] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.