Cross-Document Language Modeling

We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our crossdocument language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which encourages the model to learn cross-document and long-range relationships. Second, extending the recent Longformer model, we pretrain with long contexts of several thousand tokens and introduce a new attention pattern that uses sequence-level global attention to predict masked tokens, while retaining the familiar local attention elsewhere. We show that our CD-LM sets new state-of-the-art results for several multi-text tasks, including crossdocument event and entity coreference resolution, paper citation recommendation, and documents plagiarism detection, while using a significantly reduced number of training parameters relative to prior works1.

[1]  Armen Aghajanyan,et al.  Pre-training via Paraphrasing , 2020, NeurIPS.

[2]  Ido Dagan,et al.  Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution , 2019, ACL.

[3]  Cheng Li,et al.  Semantic Text Matching for Long-Form Documents , 2019, WWW.

[4]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[5]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[6]  Ido Dagan,et al.  Streamlining Cross-Document Coreference Resolution: Evaluation and Modeling , 2020, ArXiv.

[7]  Piek T. J. M. Vossen,et al.  Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution , 2014, LREC.

[8]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[9]  Liu Yang,et al.  Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching , 2020, CIKM.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Noah A. Smith,et al.  Multilevel Text Alignment with Cross-Document Attention , 2020, EMNLP.

[12]  Ido Dagan,et al.  Acquiring Predicate Paraphrases from News Tweets , 2017, *SEM.

[13]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[14]  Omer Levy,et al.  BERT for Coreference Resolution: Baselines and Analysis , 2019, EMNLP/IJCNLP.

[15]  Dan Roth,et al.  Paired Representation Learning for Event and Entity Coreference , 2020, ArXiv.

[16]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[17]  Guodong Zhou,et al.  Stance Detection with Hierarchical Attention Network , 2018, COLING.

[18]  Keith Stevens,et al.  Document Encoder Pooling Dot Product Sent Enc Pooling Sent Enc Pooling Sent Enc Pooling DNN DNN DNN Sentence Encoder Pooling Pooling Dot Product Sentence Encoder Document Encoder Pooling Sentence Level Task Document Level Task , 2019 .

[19]  Michael Strube,et al.  Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric , 2016, ACL.

[20]  Jiafeng Guo,et al.  Event Coreference Resolution with their Paraphrases and Argument-aware Embeddings , 2020, COLING.

[21]  Jackie Chi Kit Cheung,et al.  Resolving Event Coreference with Supervised Representation Learning and Clustering-Oriented Regularization , 2018, *SEM@NAACL-HLT.

[22]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[23]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[24]  Dragomir R. Radev,et al.  Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , 2019, ACL.

[25]  Kyle Lo,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[26]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[27]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[28]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[29]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[30]  Philip S. Yu,et al.  BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis , 2019, NAACL.

[31]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[32]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[33]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[34]  Chandra Bhagavatula,et al.  Content-Based Citation Recommendation , 2018, NAACL.

[35]  Lynette Hirschman,et al.  A Model-Theoretic Coreference Scoring Scheme , 1995, MUC.

[36]  Ido Dagan,et al.  Paraphrasing vs Coreferring: Two Sides of the Same Coin , 2020, FINDINGS.

[37]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[38]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[39]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[40]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[41]  Piek T. J. M. Vossen,et al.  "Bag of Events" Approach to Event Coreference Resolution. Supervised Classification of Event Templates , 2015, Int. J. Comput. Linguistics Appl..

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.