A Mixture of h - 1 Heads is Better than h Heads
暂无分享,去创建一个
Roy Schwartz | Dianqi Li | Hao Peng | Noah A. Smith | Roy Schwartz | Dianqi Li | Hao Peng
[1] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[2] André F. T. Martins,et al. Adaptively Sparse Transformers , 2019, EMNLP.
[3] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[4] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.
[5] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[6] Alexander M. Rush,et al. Latent Alignment and Variational Attention , 2018, NeurIPS.
[7] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[8] André F. T. Martins,et al. Sparse and Constrained Attention for Neural Machine Translation , 2018, ACL.
[9] Marcello Federico,et al. Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.
[10] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[11] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.
[12] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.
[13] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[14] Daniel Soudry,et al. No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.
[15] Ryota Tomioka,et al. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.
[16] Yonatan Belinkov,et al. Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.
[17] Alexei Baevski,et al. Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.
[18] Hannaneh Hajishirzi,et al. Mixture Content Selection for Diverse Sequence Generation , 2019, EMNLP.
[19] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[20] Marc'Aurelio Ranzato,et al. Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.
[21] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[22] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.
[23] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[24] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.
[25] Gholamreza Haffari,et al. Sequence to Sequence Mixture Model for Diverse Machine Translation , 2018, CoNLL.
[26] Marc'Aurelio Ranzato,et al. Mixture Models for Diverse Machine Translation: Tricks of the Trade , 2019, ICML.
[27] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[28] Gholamreza Haffari,et al. Selective Attention for Context-aware Neural Machine Translation , 2019, NAACL.
[29] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[30] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.
[31] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[32] Philipp Koehn,et al. Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.
[33] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.
[34] Cho-Jui Hsieh,et al. Towards Robust Neural Networks via Random Self-ensemble , 2017, ECCV.
[35] Xing Wang,et al. Modeling Recurrence for Transformer , 2019, NAACL.
[36] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[37] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[38] Rich Caruana,et al. Ensemble selection from libraries of models , 2004, ICML.
[39] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.
[40] Andrew McCallum,et al. Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.
[41] Zhaopeng Tu,et al. Convolutional Self-Attention Networks , 2019, NAACL.
[42] Georges Quénot,et al. Coupled Ensembles of Neural Networks , 2017, 2018 International Conference on Content-Based Multimedia Indexing (CBMI).
[43] Takehito Utsuro,et al. Attention over Heads: A Multi-Hop Attention for Neural Machine Translation , 2019, ACL.
[44] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[45] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.