RRWKV: Capturing Long-range Dependencies in RWKV
暂无分享,去创建一个
[1] Quentin G. Anthony,et al. RWKV: Reinventing RNNs for the Transformer Era , 2023, EMNLP.
[2] Jianzhuang Liu,et al. TRAR: Routing the Attention Spans in Transformer for Visual Question Answering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[3] Quoc V. Le,et al. Primer: Searching for Efficient Transformers for Language Modeling , 2021, NeurIPS.
[4] Nitish Srivastava,et al. An Attention Free Transformer , 2021, ArXiv.
[5] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[6] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[7] Yu Zhang,et al. Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.
[8] Gang Sun,et al. Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[9] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[10] Phong Le,et al. Quantifying the Vanishing Gradient and Long Distance Dependency Problem in Recursive Neural Networks and Recursive LSTMs , 2016, Rep4NLP@ACL.
[11] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.
[12] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[13] Alex Graves. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.
[14] Sepp Hochreiter,et al. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..
[15] S. Hochreiter,et al. Long Short-Term Memory , 1997, Neural Computation.
[16] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[17] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .