论文信息 - How Do Your Biomedical Named Entity Models Generalize to Novel Entities?

How Do Your Biomedical Named Entity Models Generalize to Novel Entities?

The number of biomedical literature on new biomedical concepts is rapidly increasing, which necessitates a reliable biomedical named entity recognition (BioNER) model for identifying new and unseen entity mentions. However, it is questionable whether existing BioNER models can effectively handle them. In this work, we systematically analyze the three types of recognition abilities of BioNER models: memorization, synonym generalization, and concept generalization. We find that although BioNER models achieve state-of-the-art performance on BioNER benchmarks based on overall performance, they have limitations in identifying synonyms and new biomedical concepts such as COVID-19. From this observation, we conclude that existing BioNER models are overestimated in terms of their generalization abilities. Also, we identify several difficulties in recognizing unseen mentions in BioNER and make the following conclusions: (1) BioNER models tend to exploit dataset biases, which hinders the models’ abilities to generalize, and (2) several biomedical names have novel morphological patterns with little name regularity such as COVID-19, and models fail to recognize them. We apply a current statistics-based debiasing method to our problem as a simple remedy and show the improvement in generalization to unseen mentions. We hope that our analyses and findings would be able to facilitate further research into the generalization capabilities of NER models in a domain where their reliability is of utmost importance.

Jaewoo Kang | Hyunjae Kim

[1] Zhiyong Lu,et al. TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[2] Maosong Sun,et al. Learning from Context or Names? An Empirical Study on Neural Relation Extraction , 2020, EMNLP.

[3] Proux,et al. Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[4] Philippe Langlais,et al. Context-aware Adversarial Training for Name Regularity Bias in Named Entity Recognition , 2021, Transactions of the Association for Computational Linguistics.

[5] Martijn J. Schuemie,et al. A dictionary to identify small molecules and drugs in free text , 2009, Bioinform..

[6] K. E. Ravikumar,et al. A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[7] Goran Nenadic,et al. LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[8] Zhiyong Lu,et al. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets , 2019, BioNLP@ACL.

[9] David Yarowsky,et al. Techniques in Speech Acoustics , 1999, Computational Linguistics.

[10] Esther Landhuis,et al. Scientific literature: Information overload , 2016, Nature.

[11] Jinlan Fu,et al. Interpretable Multi-dataset Evaluation for Named Entity Recognition , 2020, EMNLP.