论文信息 - If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks - 字舞流文

If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks

Recent work in signal propagation theory has shown that dropout limits the depth to which information can propagate through a neural network. In this paper, we investigate the effect of initialisation on training speed and generalisation for ReLU networks within this depth limit. We ask the following research question: given that critical initialisation is crucial for training at large depth, if dropout limits the depth at which networks are trainable, does initialising critically still matter? We conduct a large-scale controlled experiment, and perform a statistical analysis of over $12000$ trained networks. We find that (1) trainable networks show no statistically significant difference in performance over a wide range of non-critical initialisations; (2) for initialisations that show a statistically significant difference, the net effect on performance is small; (3) only extreme initialisations (very small or very large) perform worse than criticality. These findings also apply to standard ReLU networks of moderate depth as a special case of zero dropout. Our results therefore suggest that, in the shallow-to-moderate depth setting, critical initialisation provides zero performance gains when compared to off-critical initialisations and that searching for off-critical initialisations that might improve training speed or generalisation, is likely to be a fruitless endeavour.

Benjamin Rosman | Steve Kroon | Ryan Eloff | Benjamin van Niekerk | Arnu Pretorius | Elan Van Biljon | Herman Kamper | Elan van Biljon | Matthew Reynard | Steve James | Benjamin Rosman | H. Kamper | Arnu Pretorius | Steve Kroon | B. V. Niekerk | Steven D. James | Ryan Eloff | Matthew Reynard

[1] Oge Marques,et al. Dropout vs. batch normalization: an empirical study of their impact to deep learning , 2020, Multimedia Tools and Applications.

[2] Pierre Baldi,et al. Understanding Dropout , 2013, NIPS.

[3] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[4] Quoc V. Le,et al. DropBlock: A regularization method for convolutional networks , 2018, NeurIPS.

[5] Samuel S. Schoenholz,et al. Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks , 2018, ICML.

[6] Steve Kroon,et al. Critical initialisation for deep signal propagation in noisy rectifier neural networks , 2018, NeurIPS.

[7] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[8] Dmitry P. Vetrov,et al. Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[9] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[10] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[11] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[12] Surya Ganguli,et al. Deep Information Propagation , 2016, ICLR.

[13] Shahrokh Valaee,et al. Ising-dropout: A Regularization Method for Training and Compression of Deep Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15] Sida I. Wang,et al. Dropout Training as Adaptive Regularization , 2013, NIPS.

[16] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.

[18] R. Iman,et al. Approximations of the critical region of the fbietkan statistic , 1980 .

[19] Iñaki Inza,et al. Dealing with the evaluation of supervised classification algorithms , 2015, Artificial Intelligence Review.

[20] J M Kendall,et al. Designing a research project: randomised controlled trials and their principles , 2003, Emergency medicine journal : EMJ.

[21] Francisco Herrera,et al. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[22] Jascha Sohl-Dickstein,et al. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[23] Christopher D. Manning,et al. Fast dropout training , 2013, ICML.

[24] Kevin Gimpel,et al. Adjusting for Dropout Variance in Batch Normalization and Weight Initialization , 2016 .

[25] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[26] Alex Kendall,et al. Concrete Dropout , 2017, NIPS.

[27] Surya Ganguli,et al. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[28] Surya Ganguli,et al. Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[29] Ariel D. Procaccia,et al. Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[30] H. Finner. On a Monotonicity Problem in Step-Down Multiple Test Procedures , 1993 .

[31] S. García,et al. An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[32] Kunal Talwar,et al. Targeted Dropout , 2018 .

[33] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[35] M. Friedman. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .