ReZero is All You Need: Fast Convergence at Large Depth
暂无分享,去创建一个
Garrison W. Cottrell | H. H. Mao | Huanru Henry Mao | Bodhisattwa Prasad Majumder | Julian McAuley | Thomas Bachlechner | G. Cottrell | Julian McAuley | Thomas C. Bachlechner
[1] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[2] Samuel S. Schoenholz,et al. Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs , 2019, ArXiv.
[3] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[4] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[5] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .
[6] Pedro H. P. Savarese. Learning Identity Mappings with Residual Gates , 2016, ArXiv.
[7] Noah Constant,et al. Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.
[8] Samuel S. Schoenholz,et al. Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.
[9] Jascha Sohl-Dickstein,et al. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.
[10] Razvan Pascanu,et al. On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.
[11] Samuel L. Smith,et al. Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks , 2020, NeurIPS.
[12] Kevin Gimpel,et al. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , 2016, ArXiv.
[13] Julian Salazar,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.
[14] Nicholay Topin,et al. Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.
[15] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[16] Samuel L. Smith,et al. Batch Normalization Biases Deep Residual Networks Towards Shallow Paths , 2020, ArXiv.
[17] Liwei Wang,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[18] Surya Ganguli,et al. Deep Information Propagation , 2016, ICLR.
[19] Surya Ganguli,et al. The Emergence of Spectral Universality in Deep Networks , 2018, AISTATS.
[20] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[22] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[23] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[24] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[25] James Demmel,et al. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.
[26] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.
[27] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.
[28] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[29] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.
[30] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[31] Tengyu Ma,et al. Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.
[32] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[33] Zhi Zhang,et al. Bag of Tricks for Image Classification with Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.
[35] Surya Ganguli,et al. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.
[36] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[37] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[38] Jiri Matas,et al. All you need is a good init , 2015, ICLR.
[39] Jürgen Schmidhuber,et al. Highway Networks , 2015, ArXiv.
[40] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[41] Surya Ganguli,et al. Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.
[42] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.
[43] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..