暂无分享,去创建一个
Roy Schwartz | Noah A. Smith | Nikolaos Pappas | Dani Yogatama | Lingpeng Kong | Noah A. Smith | Hao Peng | Dani Yogatama | Roy Schwartz | Lingpeng Kong | Hao Peng | Nikolaos Pappas
[1] Aurko Roy,et al. Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.
[2] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.
[3] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.
[4] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[5] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[7] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.
[8] Luke S. Zettlemoyer,et al. Transformers with convolutional context for ASR , 2019, ArXiv.
[9] Ankit Singh Rawat,et al. Sampled Softmax with Random Fourier Features , 2019, NeurIPS.
[10] Barnabás Póczos,et al. Fast Function to Function Regression , 2014, AISTATS.
[11] Wenhu Chen,et al. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.
[12] David P. Woodruff,et al. Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..
[13] J. Schmidhuber. Reducing the Ratio Between Learning Complexity and Number of Time Varying Variables in Fully Recurrent Nets , 1993 .
[14] Marc'Aurelio Ranzato,et al. Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.
[15] Sanjiv Kumar,et al. Orthogonal Random Features , 2016, NIPS.
[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[17] Noah A. Smith,et al. Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation , 2020, ArXiv.
[18] Lawrence K. Saul,et al. Kernel Methods for Deep Learning , 2009, NIPS.
[19] Hermann Ney,et al. Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification , 2019, IWSLT.
[20] Yee Whye Teh,et al. Set Transformer , 2018, ICML.
[21] James Henderson,et al. Document-Level Neural Machine Translation with Hierarchical Attention Networks , 2018, EMNLP.
[22] Philipp Koehn,et al. Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.
[23] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.
[24] T. Teichmann,et al. Harmonic Analysis and the Theory of Probability , 1957, The Mathematical Gazette.
[25] Alexei Baevski,et al. Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.
[26] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[27] Omer Levy,et al. Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.
[28] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[29] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.
[30] Vikas Sindhwani,et al. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..
[31] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[32] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.
[33] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[34] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[35] Marcello Federico,et al. Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.
[36] Samuel R. Bowman,et al. ListOps: A Diagnostic Dataset for Latent Tree Learning , 2018, NAACL.
[37] Michael Rabadi,et al. Kernel Methods for Machine Learning , 2015 .
[38] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.
[39] Xing Wang,et al. Modeling Recurrence for Transformer , 2019, NAACL.
[40] Timothy P. Lillicrap,et al. Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.
[41] Tim Salimans,et al. Axial Attention in Multidimensional Transformers , 2019, ArXiv.
[42] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[43] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[44] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.
[45] Li Yang,et al. ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.
[46] Lukasz Kaiser,et al. Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.
[47] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[48] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[49] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[50] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.
[51] M. Rudelson,et al. Random Features Methods in Supervised Learning , 2019 .
[52] Geoffrey E. Hinton,et al. Using Fast Weights to Attend to the Recent Past , 2016, NIPS.
[53] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[54] Zhijian Liu,et al. Lite Transformer with Long-Short Range Attention , 2020, ICLR.
[55] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[56] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.
[57] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.
[58] Makoto Yamada,et al. Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , 2019, EMNLP/IJCNLP.
[59] Kenneth O. Stanley,et al. Differentiable plasticity: training plastic neural networks with backpropagation , 2018, ICML.
[60] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.
[61] Razvan Pascanu,et al. Stabilizing Transformers for Reinforcement Learning , 2019, ICML.
[62] Mohit Iyyer,et al. Hard-Coded Gaussian Attention for Neural Machine Translation , 2020, ACL.
[63] Dragomir R. Radev,et al. The ACL Anthology Network , 2009 .
[64] Masao Utiyama,et al. Recurrent Positional Embedding for Neural Machine Translation , 2019, EMNLP/IJCNLP.
[65] Nikolaos Pappas,et al. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.
[66] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ArXiv.
[67] Jürgen Schmidhuber,et al. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.
[68] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[69] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[70] Klaus-Robert Müller,et al. An Empirical Study on The Properties of Random Bases for Kernel Methods , 2017, NIPS.
[71] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[72] Lukasz Kaiser,et al. Rethinking Attention with Performers , 2020, ArXiv.
[73] Roy Schwartz,et al. Rational Recurrences , 2018, EMNLP.
[74] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.
[75] Liu Yang,et al. Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.
[76] Liu Yang,et al. Sparse Sinkhorn Attention , 2020, ICML.