tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Motivation Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/. Contact zhiyong.lu@nih.gov.

[1]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[2]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[3]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[4]  Laura Inés Furlong,et al.  OSIRIS: a tool for retrieving literature about sequence variants , 2006, Bioinform..

[5]  Laura Inés Furlong,et al.  OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature , 2008, BMC Bioinformatics.

[6]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[7]  Laura Inés Furlong,et al.  Identifying gene-Specific Variations in Biomedical Text , 2007, J. Bioinform. Comput. Biol..

[8]  Martin Hofmann-Apitius,et al.  Knowledge environments representing molecular entities for the virtual physiological human , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[9]  Alfonso Valencia,et al.  Extraction of human kinase mutations from literature, databases and genotyping studies , 2009, BMC Bioinformatics.

[10]  A. Kouznetsov,et al.  Algorithms and semantic infrastructure for mutation impact extraction and grounding , 2010, BMC Genomics.

[11]  Olivier Bodenreider,et al.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature , 2011, Bioinform..

[12]  Laura Inés Furlong,et al.  Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers , 2011, BMC Bioinformatics.

[13]  Zhiyong Lu,et al.  Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts , 2012, Database J. Biol. Databases Curation.

[14]  Nona Naderi,et al.  Automated extraction and semantic analysis of mutation impacts from the biomedical literature , 2012, BMC Genomics.

[15]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[16]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[17]  Zhiyong Lu,et al.  Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing , 2014, Database J. Biol. Databases Curation.

[18]  Karin M. Verspoor,et al.  Literature mining of genetic variants for curation: quantifying the importance of supplementary material , 2014, Database J. Biol. Databases Curation.

[19]  Karin M. Verspoor,et al.  Open Peer Review Invited Referee Responses , 2022 .

[20]  H. Baptista-González,et al.  [Molecular identification of glucose-6-phosphate dehydrogenase (G6PD) detected in neonatal screening]. , 2015, Gaceta medica de Mexico.

[21]  Raul Rodriguez-Esteban,et al.  Biocuration with insufficient resources and fixed timelines , 2015, Database J. Biol. Databases Curation.

[22]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[23]  François Schiettecatte,et al.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders , 2014, Nucleic Acids Res..

[24]  Senay Kafkas,et al.  Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles , 2015, J. Biomed. Semant..

[25]  Zhiyong Lu,et al.  Beyond accuracy: creating interoperable and scalable text-mining web services , 2016, Bioinform..

[26]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[27]  Hongfang Liu,et al.  Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature , 2015, BMC Bioinformatics.

[28]  Zhiyong Lu,et al.  Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine , 2016, PLoS Comput. Biol..

[29]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[30]  Jaewoo Kang,et al.  BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations , 2016, Database J. Biol. Databases Curation.

[31]  Ulf Leser,et al.  SETH detects and normalizes genetic variants in text , 2016, Bioinform..

[32]  Andrew J. Hill,et al.  Analysis of protein-coding genetic variation in 60,706 humans , 2015, bioRxiv.

[33]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[34]  Steven J. M. Jones,et al.  CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer , 2017, Nature Genetics.

[35]  Mingming Jia,et al.  COSMIC: somatic cancer genetics at high-resolution , 2016, Nucleic Acids Res..

[36]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..