Coreference Resolution for the Biomedical Domain: A Survey

Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. Thus, the biomedical genre has long been the second most researched genre for coreference resolution after the news domain, and the subject of a great deal of research for NLP in general. In recent years this interest has grown enormously leading to the development of a number of substantial datasets, of domain-specific contextual language models, and of several architectures. In this paper we review the state of-the-art of coreference in the biomedical domain with a particular attention on these most recent developments.

[1]  Xiaoqiang Luo,et al.  Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation , 2014, ACL.

[2]  Sophia Ananiadou,et al.  Coreference Resolution in Full Text Articles with BERT and Syntax-based Mention Filtering , 2019, BioNLP-OST@EMNLP-IJNCLP.

[3]  Jian Su,et al.  Coreference Resolution in Biomedical Texts: a Machine Learning Approach , 2008, Ontologies and Text Mining for Life Sciences.

[4]  Luke S. Zettlemoyer,et al.  Higher-Order Coreference Resolution with Coarse-to-Fine Inference , 2018, NAACL.

[5]  K. Vijay-Shanker,et al.  Anaphora Resolution of Demonstrative Noun Phrases in Medline Abstracts , 2005 .

[6]  Jun'ichi Tsujii,et al.  Improving protein coreference resolution by simple semantic classification , 2011, BMC Bioinformatics.

[7]  Ellen Riloff,et al.  The Taming of Reconcile as a Biomedical Coreference Resolver , 2011, BioNLP@ACL.

[8]  Wei Luo,et al.  Medstract: creating large-scale information servers from biomedical texts , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[9]  Nicholas Jing Yuan,et al.  Integrating Graph Contextualized Knowledge into Pre-trained Language Models , 2019, FINDINGS.

[10]  Dina Demner-Fushman,et al.  Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text , 2016, PloS one.

[11]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[12]  Malaikannan Sankarasubbu,et al.  BioELECTRA:Pretrained Biomedical text Encoder using Discriminators , 2021, BIONLP.

[13]  Zhiyong Lu,et al.  Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[14]  Omer Levy,et al.  BERT for Coreference Resolution: Baselines and Analysis , 2019, EMNLP/IJCNLP.

[15]  Vincent Ng,et al.  Anaphora resolution in biomedical literature: a hybrid approach , 2012, BCB.

[16]  Fei Huang,et al.  Improving Biomedical Pretrained Language Models with Knowledge , 2021, BIONLP.

[17]  Karin M. Verspoor,et al.  Evaluation of Coreference Resolution for Biomedical Text , 2014, MedIR@SIGIR.

[18]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[19]  Jian Su,et al.  An NP-Cluster Based Approach to Coreference Resolution , 2004, COLING.

[20]  Donghong Ji,et al.  Enriching contextualized language model from knowledge graph for biomedical information extraction , 2020, Briefings Bioinform..

[21]  Jiwei Li,et al.  CorefQA: Coreference Resolution as Query-based Span Prediction , 2020, ACL.

[22]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[23]  Helen Chen,et al.  UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus , 2021, NAACL.

[24]  Sophia Ananiadou,et al.  Building a Coreference-Annotated Corpus from the Domain of Biochemistry , 2011, BioNLP@ACL.

[25]  Pengzhen Cheng,et al.  Knowledge enhanced LSTM for coreference resolution on biomedical texts , 2021, Bioinform..

[26]  Henghui Zhu,et al.  Enhancing Clinical BERT Embedding using a Biomedical Knowledge Base , 2020, COLING.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Jaewoo Kang,et al.  Biomedical Entity Representations with Synonym Marginalization , 2020, ACL.

[29]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[30]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[31]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[32]  Jin-Dong Kim,et al.  Overview of the protein coreference task in BioNLP Shared Task 2011 , 2011 .

[33]  K. Bretonnel Cohen,et al.  Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles , 2017, BMC Bioinformatics.

[34]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[35]  Jong C. Park,et al.  BioAR: Anaphora Resolution for Relating Protein Names to Proteome Database Entries , 2004 .

[36]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[37]  Jing Zhang,et al.  Coreference resolution in biomedical texts , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[38]  Ted Briscoe,et al.  Statistical Anaphora Resolution in Biomedical Texts , 2008, COLING.

[39]  Karin M. Verspoor,et al.  A categorical analysis of coreference resolution errors in biomedical texts , 2016, J. Biomed. Informatics.

[40]  César de Pablo-Sánchez,et al.  Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents , 2010, BMC Bioinformatics.

[41]  Zhengyun Zhao,et al.  CODER: Knowledge-infused cross-lingual medical term embedding for term normalization , 2020, J. Biomed. Informatics.

[42]  Ruth L. Seal,et al.  Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme , 2007 .

[43]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[44]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[45]  Sophia Ananiadou,et al.  Boosting automatic event extraction from the literature using domain adaptation and coreference resolution , 2012, Bioinform..

[46]  Maosong Sun,et al.  Coreferential Reasoning Learning for Language Representation , 2020, EMNLP.

[47]  Claire Cardie,et al.  Coreference Resolution with Reconcile , 2010, ACL.

[48]  Jun'ichi Tsujii,et al.  Syntax Annotation for the GENIA Corpus , 2005, IJCNLP.

[49]  Marco Basaldella,et al.  Self-alignment Pre-training for Biomedical Entity Representations , 2020, ArXiv.

[50]  Yang Zhang,et al.  Bio-Megatron: Larger Biomedical Domain Language Model , 2020, EMNLP.

[51]  Yu-Hsiang Lin,et al.  Pronominal and Sortal Anaphora Resolution for Biomedical Literature , 2004, ROCLING/IJCLCLP.

[52]  Qinghua Zheng,et al.  A set of domain rules and a deep network for protein coreference resolution , 2018, Database J. Biol. Databases Curation.

[53]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[54]  Ron Artstein,et al.  Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus , 2019, Natural Language Engineering.

[55]  Sophia Ananiadou,et al.  Investigating Domain-Specific Information for Neural Coreference Resolution on Biomedical Texts , 2018, BioNLP.