$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
暂无分享,去创建一个
[1] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[2] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[3] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[4] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[5] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[6] Mitesh M. Khapra,et al. On Controllable Sparse Alternatives to Softmax , 2018, NeurIPS.
[7] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.
[8] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[9] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.
[10] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[11] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[12] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[13] Zheng Zhang,et al. Star-Transformer , 2019, NAACL.
[14] Zheng Zhang,et al. BP-Transformer: Modelling Long-Range Context via Binary Partitioning , 2019, ArXiv.
[15] André F. T. Martins,et al. Sparse Sequence-to-Sequence Models , 2019, ACL.
[16] Jesse Johnson,et al. Deep, Skinny Neural Networks are not Universal Approximators , 2018, ICLR.
[17] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[18] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.
[19] Yingming Li,et al. Fine-tune BERT with Sparse Self-Attention Mechanism , 2019, EMNLP.
[20] Martin Wattenberg,et al. Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.
[21] Xuancheng Ren,et al. Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.
[22] Pablo Barceló,et al. On the Turing Completeness of Modern Neural Network Architectures , 2019, ICLR.
[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[24] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[25] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[26] André F. T. Martins,et al. Adaptively Sparse Transformers , 2019, EMNLP.
[27] Jiwei Li,et al. SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection , 2020, NeurIPS.
[28] M. Zaheer,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.
[29] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.
[30] Omer Levy,et al. Blockwise Self-Attention for Long Document Understanding , 2019, FINDINGS.
[31] Sashank J. Reddi,et al. Are Transformers universal approximators of sequence-to-sequence functions? , 2019, ICLR.
[32] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.
[33] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ACM Comput. Surv..
[34] On Identifiability in Transformers , 2019, ICLR.
[35] Ankit Singh Rawat,et al. Low-Rank Bottleneck in Multi-head Attention Models , 2020, ICML.
[36] Michael Hahn,et al. Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.
[37] Aurko Roy,et al. Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.
[38] Jinwoo Shin,et al. Minimum Width for Universal Approximation , 2020, ICLR.