On the Transformer Growth for Progressive BERT Training
暂无分享,去创建一个
Chen Chen | Hongkun Yu | Jiawei Han | Xiaotao Gu | Jing Li | Liyuan Liu | Jing Li | Jiawei Han | Chen Chen | Liyuan Liu | Hongkun Yu | Xiaotao Gu
[1] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[2] Tomas Mikolov,et al. Variable Computation in Recurrent Neural Networks , 2016, ICLR.
[3] Frederick Tung,et al. Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.
[4] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[5] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.
[6] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[7] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.
[8] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[9] Changhu Wang,et al. Modularized Morphing of Neural Networks , 2017, ArXiv.
[10] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[11] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[12] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[13] Ali Farhadi,et al. Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.
[14] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[15] Kaiming He,et al. A Multigrid Method for Efficiently Training Video Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Tianqi Chen,et al. Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.
[17] Alex Graves,et al. Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.
[18] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[19] Jaakko Lehtinen,et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.
[20] Jingbo Zhu,et al. Shallow-to-Deep Training for Neural Machine Translation , 2020, EMNLP.
[21] Guokun Lai,et al. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , 2020, NeurIPS.
[22] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[24] Jingbo Shang,et al. Towards Adaptive Residual Network Training: A Neural-ODE Perspective , 2020, ICML.
[25] Xiang Ren,et al. Empower Sequence Labeling with Task-Aware Neural Language Model , 2017, AAAI.
[26] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[27] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.
[28] Changhu Wang,et al. Network Morphism , 2016, ICML.
[29] Feng Yan,et al. AutoGrow: Automatic Layer Growing in Deep Convolutional Networks , 2019, KDD.