暂无分享,去创建一个
[1] Satrajit Chatterjee,et al. Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization , 2020, ICLR.
[2] Ken-ichi Kawarabayashi,et al. What Can Neural Networks Reason About? , 2019, ICLR.
[3] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.
[4] Byron C. Wallace,et al. Attention is not Explanation , 2019, NAACL.
[5] Luca Maria Gambardella,et al. Max-pooling convolutional neural networks for vision-based hand gesture recognition , 2011, 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA).
[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[7] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[8] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[9] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.
[10] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[11] Ersin Yumer,et al. Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints , 2019, ICLR.
[12] Yang Liu,et al. On Identifiability in Transformers , 2020, ICLR.
[13] Raia Hadsell,et al. Neural Execution of Graph Algorithms , 2020, ICLR.
[14] Jure Leskovec,et al. How Powerful are Graph Neural Networks? , 2018, ICLR.
[15] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[16] Razvan Pascanu,et al. Stabilizing Transformers for Reinforcement Learning , 2019, ICML.
[17] Garrison W. Cottrell,et al. ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.
[18] Yaron Lipman,et al. On Universal Equivariant Set Networks , 2020, ICLR.
[19] Ankit Singh Rawat,et al. Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.
[20] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.
[21] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[22] Zhijian Liu,et al. Lite Transformer with Long-Short Range Attention , 2020, ICLR.
[23] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.
[24] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.
[25] Omer Levy,et al. Improving Transformer Models by Reordering their Sublayers , 2020, ACL.
[26] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.
[27] Andrea Vedaldi,et al. Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.
[28] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.
[29] Roger Wattenhofer,et al. Attentive Multi-Task Deep Reinforcement Learning , 2019, ECML/PKDD.
[30] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.
[31] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[32] Roger Wattenhofer,et al. Telling BERT's full story: from Local Attention to Global Aggregation , 2020, ArXiv.
[33] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[34] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .
[35] Julian Salazar,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.
[36] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[37] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.