Discovering and Summarizing Relationships Between Chemicals, Genes, Proteins, and Diseases in PubChem

The literature knowledge panels developed and implemented in PubChem are described. These help to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature abstracts. Named entities in PubMed records are matched with chemical names in PubChem, disease names in Medical Subject Headings (MeSH), and gene/protein names in popular gene/protein information resources, and the most closely related entities are identified using statistical analysis and relevance-based sampling. Knowledge panels for the co-occurrence of chemical, disease, and gene/protein entities are included in PubChem Compound, Protein, and Gene pages, summarizing these in a compact form. Statistical methods for removing redundancy and estimating relevance scores are discussed, along with benefits and pitfalls of relying on automated (i.e., not human-curated) methods operating on data from multiple heterogeneous sources.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[3]  Roger A. Sayle,et al.  Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction , 2012, J. Chem. Inf. Model..

[4]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[5]  Jatinder Singh Guide to Pharmacology , 1944, Journal of pharmacology & pharmacotherapeutics.

[6]  Daniel M. Lowe,et al.  LeadMine: a grammar and dictionary driven approach to entity recognition , 2015, Journal of Cheminformatics.

[7]  Egon L. Willighagen,et al.  PubChemRDF: towards the semantic annotation of PubChem compound and substance databases , 2015, Journal of Cheminformatics.

[8]  A. Valencia,et al.  Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications , 2011, Molecular informatics.

[9]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[10]  Adam J Pawson,et al.  The IUPHAR/BPS Guide to PHARMACOLOGY in 2020: extending immunopharmacology content and introducing the IUPHAR/MMV Guide to MALARIA PHARMACOLOGY , 2019, Nucleic Acids Res..

[11]  Paul Denny,et al.  Genenames.org: the HGNC and VGNC resources in 2019 , 2018, Nucleic Acids Res..

[12]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Akiko Aizawa,et al.  An information-theoretic perspective of tf-idf measures , 2003, Inf. Process. Manag..

[15]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[16]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[19]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[20]  Daniel M. Lowe,et al.  Efficient chemical-disease identification and relationship extraction using Wikipedia to improve recall , 2016, Database J. Biol. Databases Curation.

[21]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[22]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2018, Nucleic acids research.

[23]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[24]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[25]  J. M. Lomillos,et al.  Relationship between Vitamin B12 and Cobalt Metabolism in Domestic Ruminant: An Update , 2020, Animals : an open access journal from MDPI.

[26]  Peter Murray-Rust,et al.  Chemical Name to Structure: OPSIN, an Open Source Solution , 2011, J. Chem. Inf. Model..

[27]  Adam J. Pawson,et al.  IUPHAR/BPS guide to pharmacology , 2020 .

[28]  Evan Bolton,et al.  Literature information in PubChem: associations between PubChem records and scientific articles , 2016, Journal of Cheminformatics.

[29]  Benjamin A. Shoemaker,et al.  PubChem in 2021: new data content and improved web interfaces , 2020, Nucleic Acids Res..

[30]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[31]  Hye-Jeong Song,et al.  Comparison of named entity recognition methodologies in biomedical documents , 2018, BioMedical Engineering OnLine.

[32]  Evan Bolton,et al.  PubChem chemical structure standardization , 2018, Journal of Cheminformatics.

[33]  Evan Bolton,et al.  PubChem 2019 update: improved access to chemical data , 2018, Nucleic Acids Res..

[34]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[35]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[36]  Luca Toldo,et al.  Challenges in mining the literature for chemical information , 2013 .

[37]  Wolfram Wöß,et al.  Towards a Definition of Knowledge Graphs , 2016, SEMANTiCS.

[38]  Fei Xia,et al.  Improving biomedical named entity recognition with syntactic information , 2020, BMC Bioinform..