Improving Transformer Optimization Through Better Initialization
暂无分享,去创建一个
Maksims Volkovs | Jimmy Ba | Felipe Pérez | Xiao Shi Huang | Jimmy Ba | M. Volkovs | Felipe Pérez | Xiaoshan Huang | X. Huang
[1] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[2] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.
[3] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.
[4] Tie-Yan Liu,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[5] Richard Socher,et al. Weighted Transformer Network for Machine Translation , 2017, ArXiv.
[6] Siyang Wang,et al. Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers , 2019, ArXiv.
[7] Jingbo Zhu,et al. Learning Deep Transformer Models for Machine Translation , 2019, ACL.
[8] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.
[9] Peng Jiang,et al. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer , 2019, CIKM.
[10] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[11] Kenneth Heafield,et al. Making Asynchronous Stochastic Gradient Descent Work for Transformers , 2019, NGT@EMNLP-IJCNLP.
[12] Matiss Rikters,et al. Impact of Corpora Quality on Neural Machine Translation , 2018, Baltic HLT.
[13] Ondrej Bojar,et al. Training Tips for the Transformer Model , 2018, Prague Bull. Math. Linguistics.
[14] Denis Yarats,et al. On the adequacy of untuned warmup for adaptive optimization , 2019, AAAI.
[15] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[16] Zhongfei Zhang,et al. TVT: Two-View Transformer Network for Video Captioning , 2018, ACML.
[17] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[18] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[19] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[20] Chuang Gan,et al. End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[21] Julian Salazar,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.
[22] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[23] Rico Sennrich,et al. Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention , 2019, EMNLP.
[24] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.
[25] Wei Li,et al. Behavior sequence transformer for e-commerce recommendation in Alibaba , 2019, Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data.