Sparse is Enough in Scaling Transformers
暂无分享,去创建一个
Aakanksha Chowdhery | Lukasz Kaiser | Afroz Mohiuddin | Henryk Michalewski | Wojciech Gajewski | Sebastian Jaszczur | Jonni Kanerva | Lukasz Kaiser | Jonni Kanerva | Aakanksha Chowdhery | H. Michalewski | Afroz Mohiuddin | Wojciech Gajewski | Sebastian Jaszczur
[1] Mehdi Rezagholizadeh,et al. Fully Quantized Transformer for Machine Translation , 2020, EMNLP.
[2] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[3] Samy Bengio,et al. Discrete Autoencoders for Sequence Models , 2018, ArXiv.
[4] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[5] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[6] Dan Klein,et al. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.
[7] Zhijie Zhang,et al. Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , 2021, ICLR.
[8] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.
[9] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[10] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[11] Hany Hassan Awadalla,et al. FastFormers: Highly Efficient Transformer Models for Natural Language Understanding , 2020, SUSTAINLP.
[12] Krzysztof Maziarz,et al. Gumbel-Matrix Routing for Flexible Multi-task Learning , 2019, ArXiv.
[13] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[14] Forrest N. Iandola,et al. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? , 2020, SUSTAINLP.
[15] Lukasz Kaiser,et al. Rethinking Attention with Performers , 2020, ArXiv.
[16] Song Han,et al. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.
[17] Yao Zhao,et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.
[18] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[19] Chen Liang,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.
[20] Grigorios Tsoumakas,et al. A Divide-and-Conquer Approach to the Summarization of Long Documents , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[21] H. Ney,et al. Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture , 2020, ACL.
[22] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ArXiv.
[23] Swagath Venkataramani,et al. Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.
[24] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.
[25] Yu Zhang,et al. Training RNNs as Fast as CNNs , 2017, EMNLP 2018.
[26] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[27] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[28] Manish Gupta,et al. Compression of Deep Learning Models for Text: A Survey , 2022, ACM Trans. Knowl. Discov. Data.
[29] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.
[30] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.
[31] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[32] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[33] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[34] Erich Elsen,et al. The State of Sparsity in Deep Neural Networks , 2019, ArXiv.
[35] Wenhu Chen,et al. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.
[36] Ji Li,et al. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning , 2020, FINDINGS.
[37] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.
[38] Di He,et al. Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.
[39] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.
[40] Lukasz Kaiser,et al. Neural GPUs Learn Algorithms , 2015, ICLR.
[41] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[42] Aurko Roy,et al. Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.
[43] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.
[44] Colin Raffel,et al. Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.
[45] Franck Dernoncourt,et al. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.