论文信息 - ING OF OVERPARAMETRIZED NEURAL NETS

ING OF OVERPARAMETRIZED NEURAL NETS

It is well-known that overparametrized neural networks trained using gradientbased methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they depend on the minimum eigenvalue of a certain Gram matrix. In the latter case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds which require that this eigenvalue be large. Empirically, a number of alternative activation functions have been proposed which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth: • For non-smooth activations such as ReLU,SELU,ELU, which are not smooth because there is a point where either the first order or second order derivative is discontinuous, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data. • For smooth activations such as tanh, swish, polynomial, which have derivatives of all orders at all points, the situation is more complex: if the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisfies another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufficient. We discuss a number of extensions and applications of these results. 1

Navin Goyal | A. Panigrahi | Abhishek Shetty

[1] H. Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung) , 1912 .

[2] E. Hille,et al. Contributions to the theory of Hermitian series. II. The representation problem , 1940 .

[3] R. A. Silverman,et al. Special functions and their applications , 1966 .

[4] John P. Boyd,et al. Asymptotic coefficients of hermite function series , 1984 .

[5] S. Thangavelu. Lectures on Hermite and Laguerre expansions , 1993 .

[6] Allan Pinkus,et al. Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[7] Allan Pinkus,et al. Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[8] P. Massart,et al. Adaptive estimation of a quadratic functional by model selection , 2000 .

[9] Shang-Hua Teng,et al. Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[10] Gábor Lugosi,et al. Concentration Inequalities , 2008, COLT.

[11] R. Varga. Geršgorin And His Circles , 2004 .