CogLTX: Applying BERT to Long Texts

BERT is incapable of processing long texts due to its quadratically increasing memory and time consumption. The most natural ways to address this problem, such as slicing the text by a sliding window or simplifying transformers, suffer from insufficient long-range attentions or need customized CUDA kernels. The maximum length limit in BERT reminds us the limited capacity (5∼ 9 chunks) of the working memory of humans –— then how do human beings Cognize Long TeXts? Founded on the cognitive theory stemming from Baddeley [2], the proposed CogLTX 1 framework identifies key sentences by training a judge model, concatenates them for reasoning, and enables multi-step reasoning via rehearsal and decay. Since relevance annotations are usually unavailable, we propose to use interventions to create supervision. As a general algorithm, CogLTX outperforms or gets comparable results to SOTA models on various downstream tasks with memory overheads independent of the length of text.

[1]  Lei Li,et al.  Dynamically Fused Graph Network for Multi-hop Reasoning , 2019, ACL.

[2]  Yuan Luo,et al.  Graph Convolutional Networks for Text Classification , 2018, AAAI.

[3]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[4]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[5]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[6]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[7]  Wei Wang,et al.  Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.

[8]  Dirk Weissenborn,et al.  Making Neural QA as Simple as Possible but not Simpler , 2017, CoNLL.

[9]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Najim Dehak,et al.  Joint Verification-Identification in end-to-end Multi-Scale CNN Framework for Topic Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[13]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[14]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[15]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[16]  Hwee Tou Ng,et al.  A Question-Focused Multi-Factor Attention Network for Question Answering , 2018, AAAI.

[17]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Chang Zhou,et al.  Cognitive Graph for Multi-Hop Reading Comprehension at Scale , 2019, ACL.

[20]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[21]  O. Wilhelm,et al.  Working memory capacity - facets of a cognitive ability construct , 2000 .

[22]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[23]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Hannaneh Hajishirzi,et al.  Multi-hop Reading Comprehension through Question Decomposition and Rescoring , 2019, ACL.

[27]  Jian Su,et al.  Densely Connected Attention Propagation for Reading Comprehension , 2018, NeurIPS.

[28]  Richard Socher,et al.  Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[29]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[30]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[31]  黄瀬 浩一,et al.  読書行動識別のためのSelf-supervised Learningの実験的検討 , 2019 .

[32]  Youngja Park,et al.  Unsupervised Sentence Embedding Using Document Structure-Based Context , 2019, ECML/PKDD.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  John Brown Some Tests of the Decay Theory of Immediate Memory , 1958 .

[35]  Eunsol Choi,et al.  MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, MRQA@EMNLP.

[36]  Xiao Liu,et al.  Self-supervised Learning: Generative or Contrastive , 2020, ArXiv.

[37]  Rajarshi Das,et al.  Multi-step Retriever-Reader Interaction for Scalable Open-domain Question Answering , 2019, ICLR.

[38]  Zhe Gan,et al.  Hierarchical Graph Network for Multi-hop Question Answering , 2019, EMNLP.

[39]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[40]  T. E. Lange,et al.  Below the Surface: Analogical Similarity and Retrieval Competition in Reminding , 1994, Cognitive Psychology.

[41]  P. Carpenter,et al.  Individual differences in working memory and reading , 1980 .

[42]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[43]  Masaaki Nagata,et al.  Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction , 2019, ACL.

[44]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[47]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[48]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[49]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[50]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[51]  P. Barrouillet,et al.  Time constraints and resource sharing in adults' working memory spans. , 2004, Journal of experimental psychology. General.

[52]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[53]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.

[54]  Jes'us Villalba,et al.  Hierarchical Transformers for Long Document Classification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[55]  Richard Socher,et al.  Efficient and Robust Question Answering from Minimal Context over Documents , 2018, ACL.