Are Sixteen Heads Really Better than One?
暂无分享,去创建一个
[1] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.
[2] Babak Hassibi,et al. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.
[3] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.
[4] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.
[5] Chris Callison-Burch,et al. Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .
[6] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.
[7] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[8] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[9] Marcello Federico,et al. Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.
[10] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.
[11] David Chiang,et al. Auto-Sizing Neural Networks: With Applications to n-gram Language Models , 2015, EMNLP.
[12] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.
[13] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[14] Mirella Lapata,et al. Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.
[15] Christopher D. Manning,et al. Compression of Neural Machine Translation Models via Pruning , 2016, CoNLL.
[16] Jakob Uszkoreit,et al. A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.
[17] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.
[18] Alexander Binder,et al. Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers , 2016, ICANN.
[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[20] Naftali Tishby,et al. Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.
[21] Richard Socher,et al. Weighted Transformer Network for Machine Translation , 2017, ArXiv.
[22] Timo Aila,et al. Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.
[23] Wonyong Sung,et al. Structured Pruning of Deep Convolutional Neural Networks , 2015, ACM J. Emerg. Technol. Comput. Syst..
[24] Hanan Samet,et al. Pruning Filters for Efficient ConvNets , 2016, ICLR.
[25] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.
[26] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[27] Jörg Tiedemann,et al. An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.
[28] Tao Shen,et al. DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.
[29] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[30] Graham Neubig,et al. MTNT: A Testbed for Machine Translation of Noisy Text , 2018, EMNLP.
[31] Rico Sennrich,et al. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures , 2018, EMNLP.
[32] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[33] Graham Neubig,et al. compare-mt: A Tool for Holistic Comparison of Language Generation Systems , 2019, NAACL.
[34] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[35] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[36] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.
[37] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.