PubTator central: automated concept annotation for biomedical full text articles

PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ∼300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.

[1]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[2]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[3]  Sunghwan Sohn,et al.  Abbreviation definition identification based on automatic precision estimates , 2008, BMC Bioinformatics.

[4]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[5]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[6]  Hung-Yu Kao,et al.  Cross-species gene normalization by species inference , 2011, BMC Bioinformatics.

[7]  Zhiyong Lu,et al.  SR4GN: A Species Recognition Software Tool for Gene Normalization , 2012, PloS one.

[8]  Ulf Leser,et al.  GeneView: a comprehensive semantic search engine for PubMed , 2012, Nucleic Acids Res..

[9]  Zhiyong Lu,et al.  Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II , 2012, Database J. Biol. Databases Curation.

[10]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[11]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[12]  L. Cornelius,et al.  A Proteomic Study of Human Merkel Cell Carcinoma , 2013, Journal of proteomics & bioinformatics.

[13]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[14]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[15]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[16]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[17]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[18]  Yifan Peng,et al.  iSimp in BioC standard format: enhancing the interoperability of a sentence simplification system , 2014, Database J. Biol. Databases Curation.

[19]  Zhiyong Lu,et al.  Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing , 2014, Database J. Biol. Databases Curation.

[20]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[21]  Karin M. Verspoor,et al.  Literature mining of genetic variants for curation: quantifying the importance of supplementary material , 2014, Database J. Biol. Databases Curation.

[22]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[23]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[24]  Zhiyong Lu,et al.  SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedical Text , 2015, IEEE Journal of Biomedical and Health Informatics.

[25]  Yonghwa Choi,et al.  HiPub: translating PubMed and PMC texts to networks for knowledge discovery , 2016, Bioinform..

[26]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[27]  Raja Mazumder,et al.  DiMeX: A Text Mining System for Mutation-Disease Association Extraction , 2016, PloS one.

[28]  Zhiyong Lu,et al.  Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges , 2016, Database J. Biol. Databases Curation.

[29]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[30]  Jaehoon Choi,et al.  BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature , 2016, PloS one.

[31]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[32]  Zhiyong Lu,et al.  Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine , 2016, PLoS Comput. Biol..

[33]  Miao Zhao,et al.  A PubMed-wide study of endometriosis. , 2016, Genomics.

[34]  Jaewoo Kang,et al.  BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations , 2016, Database J. Biol. Databases Curation.

[35]  Sophia Ananiadou,et al.  SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data , 2017, Wellcome open research.

[36]  Results of the fifth edition of the BioASQ Challenge , 2017, BioNLP.

[37]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[38]  Dina Demner-Fushman,et al.  12 years on – Is the NLM medical text indexer still useful and relevant? , 2017, Journal of Biomedical Semantics.

[39]  Victor Trevino,et al.  PubTerm: a web tool for organizing, annotating and curating genes, diseases, molecules and other concepts from PubMed records , 2019, Database.

[40]  Zhiyong Lu,et al.  Scaling up data curation using deep learning: An application to literature triage in genomic variation resources , 2018, PLoS Comput. Biol..

[41]  Xuan Qin,et al.  Evaluation of the Performance of BioNLP Tools for Discovering Causal Genes in Terms with Pathway Enrichment , 2018 .

[42]  Cathy H. Wu,et al.  Integrative annotation and knowledge discovery of kinase post-translational modifications and cancer-associated mutations through federated protein ontologies and resources , 2018, Scientific Reports.

[43]  Sérgio Matos,et al.  Configurable web-services for biomedical document annotation , 2018, Journal of Cheminformatics.

[44]  Zhiyong Lu,et al.  tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine , 2018, Bioinform..

[45]  Søren Brunak,et al.  A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts , 2018, PLoS Comput. Biol..

[46]  Yifan Peng,et al.  LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC , 2018, Nucleic Acids Res..

[47]  Amos Bairoch,et al.  The Cellosaurus, a Cell-Line Knowledge Resource. , 2018, Journal of biomolecular techniques : JBT.

[48]  Russ B. Altman,et al.  A global network of biomedical relationships derived from text , 2018, Bioinform..

[49]  Zhiyong Lu,et al.  PMC text mining subset in BioC: about three million full-text articles and growing , 2019, Bioinform..

[50]  Sophia Ananiadou,et al.  Thalia: semantic search engine for biomedical abstracts , 2018, Bioinform..

[51]  Tejas Shah,et al.  LION LBD: a literature-based discovery system for cancer biology , 2018, Bioinform..

[52]  Zhiyong Lu,et al.  BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale , 2019, PLoS Comput. Biol..

[53]  H. Nakaya,et al.  ACE2 Expression is Increased in the Lungs of Patients with Comorbidities Associated with Severe COVID-19 , 2020, The Journal of infectious diseases.

[54]  Zhiyong Lu,et al.  Ten tips for a text-mining-ready article: How to improve automated discoverability and interpretability , 2020, PLoS biology.

[55]  Dina Demner-Fushman,et al.  Chemical Entity Recognition for MEDLINE Indexing. , 2020, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[56]  Raghu Machiraju,et al.  Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources , 2020, Metabolites.