Understanding the Difficulty of Training Transformers
暂无分享,去创建一个
[1] Tengyu Ma,et al. Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.
[2] Jingjing Xu,et al. MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning , 2019, ArXiv.
[3] Jascha Sohl-Dickstein,et al. Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.
[4] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[5] Jingbo Zhu,et al. Learning Deep Transformer Models for Machine Translation , 2019, ACL.
[6] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[7] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.
[8] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.
[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Surya Ganguli,et al. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.
[11] Ondrej Bojar,et al. Training Tips for the Transformer Model , 2018, Prague Bull. Math. Linguistics.
[12] Kevin Duh,et al. Very Deep Transformers for Neural Machine Translation , 2020, ArXiv.
[13] Sashank J. Reddi,et al. Why ADAM Beats SGD for Attention Models , 2019, ArXiv.
[14] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Di He,et al. Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View , 2019, ArXiv.
[16] Philipp Koehn,et al. Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.
[17] David Rolnick,et al. How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.
[18] Tie-Yan Liu,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[19] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.
[20] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.
[21] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[22] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[23] Julian Salazar,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.
[24] Alexei Baevski,et al. Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.
[25] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[26] Samuel S. Schoenholz,et al. Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.
[27] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.
[28] Garrison W. Cottrell,et al. ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.
[29] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.
[30] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[31] Tao Qin,et al. Depth Growing for Neural Machine Translation , 2019, ACL.
[32] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[33] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.
[34] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[35] Marcello Federico,et al. Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.
[36] Brian McWilliams,et al. The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.
[37] Jingbo Shang,et al. Towards Adaptive Residual Network Training: A Neural-ODE Perspective , 2020, ICML.
[38] Jiri Matas,et al. All you need is a good init , 2015, ICLR.
[39] Andrew M. Dai,et al. Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.
[40] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[41] Dustin Tran,et al. Image Transformer , 2018, ICML.