暂无分享,去创建一个
Ankit Singh Rawat | Sashank J. Reddi | Sanjiv Kumar | Srinadh Bhojanapalli | Chulhee Yun | Srinadh Bhojanapalli | Sanjiv Kumar | A. Rawat | Chulhee Yun
[1] Yang Liu,et al. On Identifiability in Transformers , 2020, ICLR.
[2] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[3] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[4] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.
[5] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[6] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.
[7] Ankit Singh Rawat,et al. Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.
[8] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[9] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[10] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[11] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.
[12] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[13] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[14] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[15] Lukasz Kaiser,et al. Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.
[16] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[17] Ruslan Salakhutdinov,et al. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.
[18] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.
[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[20] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[21] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[22] Razvan Pascanu,et al. Relational Deep Reinforcement Learning , 2018, ArXiv.
[23] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[24] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[25] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[26] André F. T. Martins,et al. Adaptively Sparse Transformers , 2019, EMNLP.
[27] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[28] Xu Tan,et al. Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion , 2019, INTERSPEECH.
[29] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.
[30] Yuxi Li,et al. Deep Reinforcement Learning: An Overview , 2017, ArXiv.
[31] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.
[32] James Demmel,et al. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.