RRWKV: Capturing Long-range Dependencies in RWKV

Owing to the impressive dot-product attention, the Transformers have been the dominant architectures in various natural language processing (NLP) tasks. Recently, the Receptance Weighted Key Value (RWKV) architecture follows a non-transformer architecture to eliminate the drawbacks of dot-product attention, where memory and computational complexity exhibits quadratic scaling with sequence length. Although RWKV has exploited a linearly tensor-product attention mechanism and achieved parallelized computations by deploying the time-sequential mode, it fails to capture long-range dependencies because of its limitation on looking back at previous information, compared with full information obtained by direct interactions in the standard transformer. Therefore, the paper devises the Retrospected Receptance Weighted Key Value (RRWKV) architecture via incorporating the retrospecting ability into the RWKV to effectively absorb information, which maintains memory and computational efficiency as well.

[1]  Quentin G. Anthony,et al.  RWKV: Reinventing RNNs for the Transformer Era , 2023, EMNLP.

[2]  Jianzhuang Liu,et al.  TRAR: Routing the Attention Spans in Transformer for Visual Question Answering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Quoc V. Le,et al.  Primer: Searching for Efficient Transformers for Language Modeling , 2021, NeurIPS.

[4]  Nitish Srivastava,et al.  An Attention Free Transformer , 2021, ArXiv.

[5]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[6]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[7]  Yu Zhang,et al.  Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.

[8]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Phong Le,et al.  Quantifying the Vanishing Gradient and Long Distance Dependency Problem in Recursive Neural Networks and Recursive LSTMs , 2016, Rep4NLP@ACL.

[11]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Alex Graves Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[14]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .