LinkBERT: Pretraining Language Models with Document Links

Language model (LM) pretraining captures various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data.

[1]  Jure Leskovec,et al.  GreaseLM: Graph REASoning Enhanced Language Models for Question Answering , 2022, ICLR.

[2]  Dragomir R. Radev,et al.  UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models , 2022, EMNLP.

[3]  A. Shashua,et al.  The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design , 2021, ICLR.

[4]  Zhicheng Dou,et al.  Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need , 2021, CIKM.

[5]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[6]  Dmytro Okhonko,et al.  HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.

[7]  Alice H. Oh,et al.  Weakly Supervised Pre-Training for Multi-Hop Retriever , 2021, FINDINGS.

[8]  Alessandro Raganato,et al.  Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks , 2021, NAACL.

[9]  J. Leskovec,et al.  QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering , 2021, NAACL.

[10]  Ido Dagan,et al.  Cross-Document Language Modeling , 2021, EMNLP.

[11]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[12]  Veselin Stoyanov,et al.  Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art , 2020, CLINICALNLP.

[13]  Zheng Zhang,et al.  CoLAKE: Contextualized Language and Knowledge Embedding , 2020, COLING.

[14]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[15]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[16]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[17]  Georgios Paliouras,et al.  Results of the Seventh Edition of the BioASQ Challenge , 2020, PKDD/ECML Workshops.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[20]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[21]  Nanyun Peng,et al.  Towards Controllable Biases in Language Generation , 2020, FINDINGS.

[22]  Miriam Fernández,et al.  Exploiting Citation Knowledge in Personalised Recommendation of Recent Scientific Publications , 2020, LREC.

[23]  Daniel S. Weld,et al.  SPECTER: Document-level Representation Learning using Citation-informed Transformers , 2020, ACL.

[24]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[25]  Jure Leskovec,et al.  Relational Message Passing for Knowledge Graph Completion , 2020, KDD.

[26]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[27]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[28]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[29]  Nicholas Jing Yuan,et al.  Integrating Graph Contextualized Knowledge into Pre-trained Language Models , 2019, FINDINGS.

[30]  R. Socher,et al.  Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[31]  Zhiyuan Liu,et al.  KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation , 2019, Transactions of the Association for Computational Linguistics.

[32]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[33]  Danqi Chen,et al.  MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, EMNLP.

[34]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[35]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[36]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[37]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[38]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[39]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[40]  Jungo Kasai,et al.  ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks , 2019, AAAI.

[41]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[42]  J. Leskovec,et al.  Strategies for Pre-training Graph Neural Networks , 2019, ICLR.

[43]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[44]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[45]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[46]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[47]  Yonatan Belinkov,et al.  Proceedings of the 2018 EMNLP Workshop BlackboxNLP : Analyzing and Interpreting Neural Networks for NLP , 2018 .

[48]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[49]  Junyi Jessy Li,et al.  A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[50]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[51]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[52]  Chandra Bhagavatula,et al.  Content-Based Citation Recommendation , 2018, NAACL.

[53]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[54]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[55]  Arzucan Özgür,et al.  BIOSSES: a semantic sentence similarity estimation system for the biomedical domain , 2017, Bioinform..

[56]  Rui Zhang,et al.  Graph-based Neural Multi-Document Summarization , 2017, CoNLL.

[57]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[58]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[59]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[60]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[61]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[62]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[63]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[64]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[65]  A. Andrén-sandberg,et al.  Pancreatic cancer and thromboembolic disease, 150 years after Trousseau. , 2015, Hepatobiliary surgery and nutrition.

[66]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[67]  Jianfeng Gao,et al.  Embedding Entities and Relations for Learning and Inference in Knowledge Bases , 2014, ICLR.

[68]  Núria Queralt-Rosinach,et al.  Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research , 2014, BMC Bioinformatics.

[69]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[70]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[71]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[72]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[73]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[74]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[75]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[76]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[77]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[78]  Gerd Gigerenzer Blurb , 2021, Judgment, Decision-Making, and Embodied Choices.

[79]  Dmytro Okhonko,et al.  Unified Open-Domain Question Answering with Structured and Unstructured Knowledge , 2020, ArXiv.

[80]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[81]  Jungo Kasai Yasunaga,et al.  ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks , 2019 .

[82]  Anália Lourenço,et al.  Overview of the BioCreative VI chemical-protein interaction Track , 2017 .

[83]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[84]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[85]  Mark Andrew Greenwood,et al.  Open-domain question answering , 2005 .

[86]  L. Galli,et al.  Normalization rates of compression ultrasonography in patients with a first episode of deep vein thrombosis of the lower limbs: association with recurrence and new thrombosis. , 2002, Haematologica.

[87]  S. Laurence,et al.  Concepts: Core Readings , 1999 .