Curatable Named-Entity Recognition Using Semantic Relations

Named-entity recognition (NER) plays an important role in the development of biomedical databases. However, the existing NER tools produce multifarious named-entities which may result in both curatable and non-curatable markers. To facilitate biocuration with a straightforward approach, classifying curatable named-entities is helpful with regard to accelerating the biocuration workflow. Co-occurrence Interaction Nexus with Named-entity Recognition (CoINNER) is a web-based tool that allows users to identify genes, chemicals, diseases, and action term mentions in the Comparative Toxicogenomic Database (CTD). To further discover interactions, CoINNER uses multiple advanced algorithms to recognize the mentions in the BioCreative IV CTD Track. CoINNER is developed based on a prototype system that annotated gene, chemical, and disease mentions in PubMed abstracts at BioCreative 2012 Track I (literature triage). We extended our previous system in developing CoINNER. The pre-tagging results of CoINNER were developed based on the state-of-the-art named entity recognition tools in BioCreative III. Next, a method based on conditional random fields (CRFs) is proposed to predict chemical and disease mentions in the articles. Finally, action term mentions were collected by latent Dirichlet allocation (LDA). At the BioCreative IV CTD Track, the best F-measures reached for gene/protein, chemical/drug and disease NER were 54 percent while CoINNER achieved a 61.5 percent F-measure. System URL: http://ikmbio.csie.ncku.edu.tw/coinner/ introduction.htm.

[1]  Zhiyong Lu,et al.  BioCreative-IV virtual issue , 2014, Database J. Biol. Databases Curation.

[2]  Olivier Bodenreider,et al.  Bio-ontologies: current trends and future directions , 2006, Briefings Bioinform..

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[5]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[6]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[7]  Hung-Yu Kao,et al.  CoIN: a network analysis for document triage , 2013, Database J. Biol. Databases Curation.

[8]  Zhiyong Lu,et al.  Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts , 2012, Database J. Biol. Databases Curation.

[9]  Thomas C. Wiegers,et al.  Web services-based text-mining demonstrates broad impacts for interoperability and process simplification , 2014, Database J. Biol. Databases Curation.

[10]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[11]  Thomas C. Wiegers,et al.  Collaborative biocuration—text-mining development task for document prioritization for curation , 2012, Database J. Biol. Databases Curation.

[12]  Andrey Rzhetsky,et al.  Benchmarking Ontologies: Bigger or Better? , 2011, PLoS Comput. Biol..

[13]  Yunqian Ma,et al.  Practical selection of SVM parameters and noise estimation for SVM regression , 2004, Neural Networks.

[14]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[15]  C. Wijmenga,et al.  Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. , 2006, American journal of human genetics.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  W. John Wilbur,et al.  Classifying protein-protein interaction articles using word and syntactic features , 2011, BMC Bioinformatics.

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Maurice H. T. Ling,et al.  BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature , 2009, BMC Bioinformatics.

[20]  Amal Zouaq,et al.  A Survey of Domain Ontology Engineering: Methods and Tools , 2010, Advances in Intelligent Tutoring Systems.

[21]  Zhiyong Lu,et al.  Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction , 2011, J. Biomed. Informatics.

[22]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[23]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[24]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[25]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[26]  Thomas C. Wiegers,et al.  Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database , 2013, PloS one.

[27]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.