PGxMine: Text mining for curation of PharmGKB

Precision medicine tailors treatment to individuals personal data including differences in their genome. The Pharmacogenomics Knowledgebase (PharmGKB) provides highly curated information on the effect of genetic variation on drug response and side effects for a wide range of drugs. PharmGKB’s scientific curators triage, review and annotate a large number of papers each year but the task is challenging. We present the PGxMine resource, a text-mined resource of pharmacogenomic associations from all accessible published literature to assist in the curation of PharmGKB. We developed a supervised machine learning pipeline to extract associations between a variant (DNA and protein changes, star alleles and dbSNP identifiers) and a chemical. PGxMine covers 452 chemicals and 2,426 variants and contains 19,930 mentions of pharmacogenomic associations across 7,170 papers. An evaluation by PharmGKB curators found that 57 of the top 100 associations not found in PharmGKB led to 83 curatable papers and a further 24 associations would likely lead to curatable papers through citations. The results can be viewed at https://pgxmine.pharmgkb.org/ and code can be downloaded at https://github.com/jakelever/pgxmine.

[1]  Zhiyong Lu,et al.  tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine , 2018, Bioinform..

[2]  R. Altman,et al.  Pharmacogenomics Knowledge for Personalized Medicine , 2012, Clinical pharmacology and therapeutics.

[3]  Michael Muchow,et al.  PubRunner: A light-weight framework for updating text mining results , 2017, F1000Research.

[4]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[5]  J. Backman,et al.  Cytochrome P450 in Pharmacogenetics: An Update. , 2018, Advances in pharmacology.

[6]  Steven J. M. Jones,et al.  Text-mining clinically relevant cancer biomarkers for curation into the CIViC database , 2018, Genome Medicine.

[7]  Steven J. M. Jones,et al.  CancerMine: A literature-mined resource for drivers, oncogenes and tumor suppressors in cancer , 2018 .

[8]  William E. Evans,et al.  Pharmacogenomics in the clinic , 2015, Nature.

[9]  John R. Engen,et al.  Novel mutant-selective EGFR kinase inhibitors against EGFR T790M , 2009, Nature.

[10]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[11]  Louise Deléger,et al.  Overview of the Bacteria Biotope Task at BioNLP Shared Task 2016 , 2016, BioNLP.

[12]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[13]  G. Ginsburg,et al.  The path to personalized medicine. , 2002, Current opinion in chemical biology.

[14]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[15]  Steven J. M. Jones,et al.  VERSE: Event and Relation Extraction in the BioNLP 2016 Shared Task , 2016, BioNLP.

[16]  Raja Mazumder,et al.  DiMeX: A Text Mining System for Mutation-Disease Association Extraction , 2016, PloS one.

[17]  Hagit Shatkay,et al.  An effective biomedical document classification scheme in support of biocuration: addressing class imbalance , 2019, Database J. Biol. Databases Curation.

[18]  Zhiyong Lu,et al.  TaggerOne: joint named entity recognition and normalization with semi-Markov Models , 2016, Bioinform..

[19]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[20]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[21]  Judith A. Blake,et al.  Integrating text mining into the MGI biocuration workflow , 2009, Database J. Biol. Databases Curation.

[22]  W. Marsden I and J , 2012 .

[23]  Steven J. Jones,et al.  Painless Relation Extraction with Kindred , 2017, BioNLP.

[24]  Robert Leaman,et al.  PubTator central: automated concept annotation for biomedical full text articles , 2019, Nucleic Acids Res..

[25]  David S. Wishart,et al.  DrugBank 5.0: a major update to the DrugBank database for 2018 , 2017, Nucleic Acids Res..

[26]  Russ B. Altman,et al.  A global network of biomedical relationships derived from text , 2018, Bioinform..

[27]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.