Partial Annotation Learning for Biomedical Entity Recognition

Motivation: Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. Results: To address this problem, we systematically study the effectiveness of partial annotation learning methods for biomedical entity recognition over different simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We harmonize 15 biomedical NER corpora encompassing five entity types to serve as a gold standard and compare against two commonly used partial annotation learning models, BiLSTM-Partial-CRF and EER-PubMedBERT, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with missing entity annotations. Our proposed model outperforms alternatives and, specifically, the PubMedBERT tagger by 38% in F1-score under high missing entity rates. The recall of entity mentions in our model is also competitive with the upper bound on the fully annotated dataset.

[1]  Arbee L. P. Chen,et al.  Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning , 2022, BMC Bioinformatics.

[2]  Michael Collins,et al.  Partially Supervised Named Entity Recognition via the Expected Entity Ratio Loss , 2021, Transactions of the Association for Computational Linguistics.

[3]  Lemao Liu,et al.  Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition , 2020, ICLR.

[4]  Maryam Habibi,et al.  HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition , 2020, Bioinform..

[5]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[6]  Chao Zhang,et al.  BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision , 2020, KDD.

[7]  Dan Roth,et al.  Named Entity Recognition with Partially Annotated Training Data , 2019, CoNLL.

[8]  Zhiyuan Liu,et al.  Low-Resource Name Tagging Learned with Weakly Labeled Data , 2019, EMNLP.

[9]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[10]  Wei Lu,et al.  Better Modeling of Incomplete Annotations for Named Entity Recognition , 2019, NAACL.

[11]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[12]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[13]  Min Zhang,et al.  Distantly Supervised NER with Partial Annotation Learning and Reinforcement Learning , 2018, COLING.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[16]  Fan Yang,et al.  Semi-Supervised Chinese Word Segmentation Using Partial-Label Learning With Conditional Random Fields , 2014, EMNLP.

[17]  Dirk Hovy,et al.  Exploiting Partial Annotations with EM Training , 2012, HLT-NAACL 2012.

[18]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[19]  Yuji Matsumoto,et al.  Training Conditional Random Fields Using Incomplete Annotations , 2008, COLING.

[20]  Andrey Rzhetsky,et al.  Emergent behavior of growing knowledge about molecular interactions , 2005, Nature Biotechnology.

[21]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[22]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[23]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  H. Robbins A Stochastic Approximation Method , 1951 .

[27]  Zhixiong Zhang,et al.  Distantly Supervised Named Entity Recognition with Category-Oriented Confidence Calibration , 2022, ICADL.

[28]  Maryam Habibi,et al.  HUNER: improving biomedical NER with pretraining , 2020, Bioinform..

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Andrew McCallum,et al.  Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets , 2018, EMNLP.

[31]  Scott Gaffney,et al.  Learning a Named Entity Tagger from Gazetteers with the Partial Perceptron , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[32]  Andrew McCallum,et al.  Learning Extractors from Unlabeled Text using Relevant Databases , 2007 .

[33]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .