Multi-Head Attention: Collaborate Instead of Concatenate
暂无分享,去创建一个
[1] L. Tucker,et al. Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.
[2] Richard A. Harshman,et al. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .
[3] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..
[4] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[5] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[6] Eunhyeok Park,et al. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.
[7] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[8] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[9] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[10] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[11] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[12] Quoc V. Le,et al. Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[13] Maja Pantic,et al. TensorLy: Tensor Learning in Python , 2016, J. Mach. Learn. Res..
[14] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[15] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.
[16] Ramesh Radhakrishnan,et al. Demystifying the MLPerf Benchmark Suite , 2019, ArXiv.
[17] Ashish Vaswani,et al. Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.
[18] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[19] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[20] Martin Jaggi,et al. On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.
[21] Noam Shazeer,et al. Talking-Heads Attention , 2020, ArXiv.
[22] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[23] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[24] Ankit Singh Rawat,et al. Low-Rank Bottleneck in Multi-head Attention Models , 2020, ICML.
[25] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[26] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.
[27] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.