Building a PubMed knowledge graph

PubMed ® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID ® , and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities. Measurement(s) textual entity • author information textual entity • funding source declaration textual entity • abstract • Biologic Entity Classification Technology Type(s) machine learning • computational modeling technique Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12452597

[1]  Ioannis Ch. Paschalidis,et al.  Clinical Concept Extraction with Contextual Word Embedding , 2018, NIPS 2018.

[2]  Daniel Jurafsky,et al.  Citation-based bootstrapping for large-scale author disambiguation , 2012, J. Assoc. Inf. Sci. Technol..

[3]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[4]  Vincent Ng,et al.  Sieve-Based Entity Linking for the Biomedical Domain , 2015, ACL.

[5]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[6]  Taehwan Kim,et al.  Author name disambiguation using a graph model with node splitting and merging based on bibliographic information , 2014, Scientometrics.

[7]  Tapio Salakoski,et al.  Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute , 2016, BioNLP@ACL.

[8]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[9]  P. Durham Calcitonin Gene‐Related Peptide (CGRP) and Migraine , 2006, Headache.

[10]  Doug Downey,et al.  Construction of the Literature Graph in Semantic Scholar , 2018, NAACL.

[11]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[12]  John L. Spouge,et al.  Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics , 2010, Bioinform..

[13]  L. Jensen,et al.  The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text , 2013, PloS one.

[14]  Laurel Haak,et al.  ORCID Public Data File 2015 , 2015 .

[15]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[16]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[17]  Robert S. DiPaola,et al.  Repurposing of Metformin and Aspirin by Targeting AMPK-mTOR and Inflammation for Pancreatic Cancer Prevention and Treatment , 2014, Cancer Prevention Research.

[18]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[19]  Pengtao Xie,et al.  Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition , 2017, MLHC.

[20]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Yonghwa Choi,et al.  A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining , 2019, IEEE Access.

[23]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[24]  Hao Wu,et al.  Unsupervised author disambiguation using Dempster–Shafer theory , 2014, Scientometrics.

[25]  Won-Kyung Sung,et al.  On co-authorship for author disambiguation , 2009, Inf. Process. Manag..

[26]  David S. Wishart,et al.  PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more , 2015, Nucleic Acids Res..

[27]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[28]  Yu Zhang,et al.  Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning , 2018, bioRxiv.

[29]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[30]  Wanli Liu,et al.  Author Name Disambiguation for PubMed , 2013, J. Assoc. Inf. Sci. Technol..

[31]  Vetle I. Torvik,et al.  MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide , 2018 .

[32]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[33]  Zhiyong Lu,et al.  SR4GN: A Species Recognition Software Tool for Gene Normalization , 2012, PloS one.

[34]  Grit Laudel,et al.  Studying the brain drain: Can bibliometric methods help? , 2003, Scientometrics.

[35]  Danielle L. Mowery,et al.  Task 1: ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[36]  Laura Inés Furlong,et al.  OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature , 2008, BMC Bioinformatics.

[37]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[38]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[39]  G. Weber,et al.  Faculty Promotion and Attrition: The Importance of Coauthor Network Reach at an Academic Medical Center , 2015, Journal of General Internal Medicine.

[40]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[41]  Laura Inés Furlong,et al.  Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers , 2011, BMC Bioinformatics.

[42]  Santo Fortunato,et al.  A dataset of publication records for Nobel laureates , 2019, Scientific Data.

[43]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[44]  P. Durham,et al.  CGRP-receptor antagonists--a fresh approach to migraine therapy? , 2004, The New England journal of medicine.

[45]  V. Sukhatme,et al.  Drug repurposing in oncology—patient and health systems opportunities , 2015, Nature Reviews Clinical Oncology.

[46]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[47]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[48]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[49]  Jiang Wu,et al.  Author name disambiguation in scientific collaboration and mobility cases , 2013, Scientometrics.

[50]  Olav Sorenson,et al.  Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists , 2016, PloS one.

[51]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[52]  Hirotaka Kawashima,et al.  Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan , 2015, Scientometrics.

[53]  Jaewoo Kang,et al.  CollaboNet: collaboration of deep neural networks for biomedical named entity recognition , 2018, BMC Bioinformatics.

[54]  Vetle I. Torvik,et al.  MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide , 2015, D Lib Mag..

[55]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[56]  Gary D. Bader,et al.  Transfer learning for biomedical named entity recognition with neural networks , 2018, bioRxiv.

[57]  Jun S. Liu,et al.  Integrated Bio-Entity Network: A System for Biological Knowledge Discovery , 2011, PloS one.

[58]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[59]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..