Stable ResNet

Deep ResNet architectures have achieved state of the art performance on many tasks. While they solve the problem of gradient vanishing, they might suffer from gradient exploding as the depth becomes large (Yang et al. 2017). Moreover, recent results have shown that ResNet might lose expressivity as the depth goes to infinity (Yang et al. 2017, Hayou et al. 2019). To resolve these issues, we introduce a new class of ResNet architectures, called Stable ResNet, that have the property of stabilizing the gradient while ensuring expressivity in the infinite depth limit.

[1]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[2]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[3]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[4]  A.B.J. Kuijlaars,et al.  Universality , 2002, Experimental Studies of Boson Fields in Solids.

[5]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[6]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  L. M. M.-T. Spherical Harmonics: an Elementary Treatise on Harmonic Functions, with Applications , 1928, Nature.

[9]  Greg Yang,et al.  Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes , 2019, NeurIPS.

[10]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[11]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[12]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[13]  Arnaud Doucet,et al.  On the Impact of the Activation Function on Deep Neural Networks Training , 2019, ICML.

[14]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[15]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[16]  Jonathan Ragan-Kelley,et al.  Neural Kernels Without Tangents , 2020, ICML.

[17]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[18]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[19]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  K. Brown,et al.  Graduate Texts in Mathematics , 1982 .

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  V. Paulsen,et al.  An Introduction to the Theory of Reproducing Kernel Hilbert Spaces , 2016 .

[23]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[24]  Roger B. Grosse,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[25]  Ines Fischer Multivariate Polysplines Applications To Numerical And Wavelet Analysis , 2016 .

[26]  Jaehoon Lee,et al.  Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.

[27]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[28]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[29]  Yee Whye Teh,et al.  Bayesian Deep Ensembles via the Neural Tangent Kernel , 2020, NeurIPS.

[30]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[31]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[32]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[33]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[34]  Yee Whye Teh,et al.  Pruning untrained neural networks: Principles and Analysis , 2020, ArXiv.

[35]  T. MacRobert Spherical harmonics : an elementary treatise on harmonic functions , 1927 .

[36]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[37]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[38]  Greg Yang,et al.  A Fine-Grained Spectral Perspective on Neural Networks , 2019, ArXiv.

[39]  Ingo Steinwart,et al.  Convergence Types and Rates in Generic Karhunen-Loève Expansions with Applications to Sample Path Properties , 2014, Potential Analysis.

[40]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[41]  Yee Whye Teh,et al.  Pruning untrained neural networks: Principles and Analysis , 2020, ArXiv.

[42]  U. Grenander Stochastic processes and statistical inference , 1950 .

[43]  A. Mukherjea,et al.  Real and Functional Analysis , 1978 .

[44]  Greg Yang Tensor Programs III: Neural Matrix Laws , 2020, ArXiv.

[45]  Jaehoon Lee,et al.  Neural Tangents: Fast and Easy Infinite Neural Networks in Python , 2019, ICLR.

[46]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[47]  Dino Sejdinovic,et al.  Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences , 2018, ArXiv.

[48]  Samuel S. Schoenholz,et al.  Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.

[49]  Samuel L. Smith,et al.  Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks , 2020, NeurIPS.

[50]  Arnaud Doucet,et al.  Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit , 2019, 1905.13654.