Revised JNLPBA Corpus: A Revised Version of Biomedical NER Corpus for Relation Extraction Task

The advancement of biomedical named entity recognition (BNER) and biomedical relation extraction (BRE) researches promotes the development of text mining in biological domains. As a cornerstone of BRE, robust BNER system is required to identify the mentioned NEs in plain texts for further relation extraction stage. However, the current BNER corpora, which play important roles in these tasks, paid less attention to achieve the criteria for BRE task. In this study, we present Revised JNLPBA corpus, the revision of JNLPBA corpus, to broaden the applicability of a NER corpus from BNER to BRE task. We preserve the original entity types including protein, DNA, RNA, cell line and cell type while all the abstracts in JNLPBA corpus are manually curated by domain experts again basis on the new annotation guideline focusing on the specific NEs instead of general terms. Simultaneously, several imperfection issues in JNLPBA are pointed out and made up in the new corpus. To compare the adaptability of different NER systems in Revised JNLPBA and JNLPBA corpora, the F1-measure was measured in three open sources NER systems including BANNER, Gimli and NERSuite. In the same circumstance, all the systems perform average 10% better in Revised JNLPBA than in JNLPBA. Moreover, the cross-validation test is carried out which we train the NER systems on JNLPBA/Revised JNLPBA corpora and access the performance in both protein-protein interaction extraction (PPIE) and biomedical event extraction (BEE) corpora to confirm that the newly refined Revised JNLPBA is a competent NER corpus in biomedical relation application. The revised JNLPBA corpus is freely available at iasl-btm.iis.sinica.edu.tw/BNER/Content/Revised_JNLPBA.zip.

[1]  David Yarowsky,et al.  Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs , 1992, ACL.

[2]  Elizabeth M. Smigielski,et al.  dbSNP: a database of single nucleotide polymorphisms , 2000, Nucleic Acids Res..

[3]  R. J. Roberts PubMed Central: The GenBank of the published literature. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[5]  Daniel Berleant,et al.  Mining MEDLINE: Abstracts, Sentences, or Phrases? , 2001, Pacific Symposium on Biocomputing.

[6]  Mathew W. Wright,et al.  Guidelines for human gene nomenclature. , 2002, Genomics.

[7]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[8]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[9]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[10]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[11]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[12]  Claire Nédellec,et al.  Learning Language in Logic - Genic Interaction Extraction Challenge , 2005 .

[13]  E. Loper,et al.  NLTK: The Natural Language Toolkit , 2006, ACL 2006.

[14]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[15]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[16]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[17]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[18]  Hongfei Lin,et al.  Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature , 2008, Comput. Biol. Chem..

[19]  Shuying Shen,et al.  Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease , 2009, BMC Bioinformatics.

[20]  Jimmy J. Lin,et al.  Evaluation of NLP Systems , 2010 .

[21]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[23]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[24]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[25]  Laura Inés Furlong,et al.  Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers , 2011, BMC Bioinformatics.

[26]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[27]  W. Marsden I and J , 2012 .

[28]  José Luís Oliveira,et al.  Gimli: open source and high-performance biomedical name recognition , 2013, BMC Bioinformatics.

[29]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[30]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[31]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[32]  Jung-Hsien Chiang,et al.  Overview of the gene ontology task at BioCreative IV , 2014, Database J. Biol. Databases Curation.

[33]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[34]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[35]  Özlem Uzuner,et al.  Creation of a new longitudinal corpus of clinical narratives , 2015, J. Biomed. Informatics.

[36]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[37]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[38]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[39]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[40]  Tao Chen,et al.  Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks , 2016, Database J. Biol. Databases Curation.

[41]  David Sontag,et al.  Learning a Health Knowledge Graph from Electronic Medical Records , 2017, Scientific Reports.

[42]  Yue Zhang,et al.  A transition‐based joint model for disease named entity recognition and normalization , 2017, Bioinform..

[43]  Xiaolin Li,et al.  GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text , 2017, Bioinform..

[44]  Parth Pathak,et al.  Annotation of a Large Clinical Entity Corpus , 2018, EMNLP.

[45]  Xinguang Li,et al.  Evaluation of Scholar's Contribution to Team Based on Weighted Co-author Network , 2019, ICPCSEE.

[46]  Tudor Groza,et al.  Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources , 2018, Nucleic Acids Res..

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Said Ouatik El Alaoui,et al.  SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions , 2020, Artif. Intell. Medicine.