Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the key ingredients that destined its success is the self-attention mechanism, which allows fully-connected contextual encoding over input tokens. However, despite its effectiveness in modeling short sequences, self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically with respect to the sequence length. Therefore, long sequences are often encoded by Transformer in chunks using a sliding window. In this paper, we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. Our proposed method allows information integration beyond local windows, which is especially beneficial for question answering (QA) and language modeling tasks that rely on long-range dependencies. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[3]  William W. Cohen,et al.  Quasar: Datasets for Question Answering by Search and Reading , 2017, ArXiv.

[4]  Jian Su,et al.  Densely Connected Attention Propagation for Reading Comprehension , 2018, NeurIPS.

[5]  Eunsol Choi,et al.  Coarse-to-Fine Question Answering for Long Documents , 2016, ACL.

[6]  Kenton Lee,et al.  A BERT Baseline for the Natural Questions , 2019, ArXiv.

[7]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[8]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[9]  Zhiyuan Liu,et al.  Denoising Distantly Supervised Open-Domain Question Answering , 2018, ACL.

[10]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[11]  Avirup Sil,et al.  Frustratingly Easy Natural Question Answering , 2019, ArXiv.

[12]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[13]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[14]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[15]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[16]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[19]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[20]  Wei Zhang,et al.  R3: Reinforced Ranker-Reader for Open-Domain Question Answering , 2018, AAAI.

[21]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[24]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[25]  Junru Zhou,et al.  Head-Driven Phrase Structure Grammar Parsing on Penn Treebank , 2019, ACL.

[26]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[27]  Richard Socher,et al.  Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[28]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.