Adaptively Sparse Transformers
暂无分享,去创建一个
André F. T. Martins | Vlad Niculae | Gonccalo M. Correia | Andr'e F.T. Martins | Vlad Niculae | Gonçalo M. Correia
[1] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[2] Philipp Koehn,et al. Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.
[3] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.
[4] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[5] Max Welling,et al. Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.
[6] Vlad Niculae,et al. A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.
[7] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[8] F. Clarke. Optimization And Nonsmooth Analysis , 1983 .
[9] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[10] Rudolf Rosa,et al. Extracting Syntactic Trees from Transformer Encoder Self-Attentions , 2018, BlackboxNLP@EMNLP.
[11] Philip Wolfe,et al. Validation of subgradient optimization , 1974, Math. Program..
[12] Gholamreza Haffari,et al. Selective Attention for Context-aware Neural Machine Translation , 2019, NAACL.
[13] André F. T. Martins,et al. Sparse Sequence-to-Sequence Models , 2019, ACL.
[14] J. Zico Kolter,et al. OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.
[15] Byron C. Wallace,et al. Attention is not Explanation , 2019, NAACL.
[16] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[18] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.
[19] Karin M. Verspoor,et al. Findings of the 2016 Conference on Machine Translation , 2016, WMT.
[20] Ruimao Zhang,et al. SSN: Learning Sparse Switchable Normalization via SparsestMax , 2019, International Journal of Computer Vision.
[21] André F. T. Martins,et al. Sparse and Constrained Attention for Neural Machine Translation , 2018, ACL.
[22] Ramón Fernández Astudillo,et al. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.
[23] Jörg Tiedemann,et al. An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.
[24] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.
[25] Rico Sennrich,et al. Context-Aware Neural Machine Translation Learns Anaphora Resolution , 2018, ACL.
[26] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[27] Alexander M. Rush,et al. Latent Alignment and Variational Attention , 2018, NeurIPS.
[28] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.
[29] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics , 1988 .
[30] Mauro Cettolo,et al. Overview of the IWSLT 2017 Evaluation Campaign , 2017, IWSLT.
[31] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[32] Jian Li,et al. Multi-Head Attention with Disagreement Regularization , 2018, EMNLP.
[33] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[34] André F. T. Martins,et al. Learning Classifiers with Fenchel-Young Losses: Generalized Entropies, Margins, and Algorithms , 2018, AISTATS.
[35] Rico Sennrich,et al. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures , 2018, EMNLP.
[36] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[37] Philipp Koehn,et al. Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.
[38] Marcin Junczys-Dowmunt,et al. Marian: Cost-effective High-Quality Neural Machine Translation in C++ , 2018, NMT@ACL.
[39] Anoop Cherian,et al. On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization , 2016, ArXiv.