Predicting the pathogenicity of protein coding mutations using Natural Language Processing

DNA-Sequencing of tumor cells has revealed thousands of genetic mutations. However, cancer is caused by only some of them. Identifying mutations that contribute to tumor growth from neutral ones is extremely challenging and is currently carried out manually. This manual annotation is very cumbersome and expensive in terms of time and money. In this study, we introduce a novel method "NLP-SNPPred" to read scientific literature and learn the implicit features that cause certain variations to be pathogenic. Precisely, our method ingests the bio-medical literature and produces its vector representation via exploiting state of the art NLP methods like sent2vec, word2vec and tf-idf. These representations are then fed to machine learning predictors to identify the pathogenic versus neutral variations. Our best model (NLPSNPPred) trained on OncoKB and evaluated on several publicly available benchmark datasets, outperformed state of the art function prediction methods. Our results show that NLP can be used effectively in predicting functional impact of protein coding variations with minimal complementary biological features. Moreover, encoding biological knowledge into the right representations, combined with machine learning methods can help in automating manual efforts. A free to use web-server is available at http://www.nlp-snppred.cbrlab.org

[1]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[2]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[3]  Adam Godzik,et al.  e-Driver: a novel method to identify protein regions driving cancer , 2014, Bioinform..

[4]  J. Miller,et al.  Predicting the Functional Effect of Amino Acid Substitutions and Indels , 2012, PloS one.

[5]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[6]  Mingming Jia,et al.  COSMIC: somatic cancer genetics at high-resolution , 2016, Nucleic Acids Res..

[7]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[8]  Jae-Hwan Jhong,et al.  Erratum to: Meta-analytic support vector machine for integrating multiple omics data , 2017, BioData Mining.

[9]  R. Gibbs,et al.  Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. , 2015, Human molecular genetics.

[10]  David Tamborero,et al.  OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes , 2013, Bioinform..

[11]  Laura Inés Furlong,et al.  Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers , 2011, BMC Bioinformatics.

[12]  K. Flaherty,et al.  Combined BRAF and MEK inhibition in melanoma with BRAF V600 mutations. , 2012, The New England journal of medicine.

[13]  Moriah H Nissan,et al.  OncoKB: A Precision Oncology Knowledge Base. , 2017, JCO precision oncology.

[14]  Joaquín Dopazo,et al.  A Pan-Cancer Catalogue of Cancer Driver Protein Interaction Interfaces , 2015, PLoS Comput. Biol..

[15]  C. Sander,et al.  3D clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets , 2017, Genome Medicine.

[16]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[17]  Min Huang,et al.  Molecularly targeted cancer therapy: some lessons from the past decade. , 2014, Trends in pharmacological sciences.

[18]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[19]  Zhiyong Lu,et al.  tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine , 2018, Bioinform..

[20]  N. McGranahan,et al.  The causes and consequences of genetic heterogeneity in cancer evolution , 2013, Nature.

[21]  K. Garber In a major shift, cancer drugs go 'tissue-agnostic'. , 2017, Science.

[22]  Yifan Peng,et al.  BioSentVec: creating sentence embeddings for biomedical texts , 2018, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[23]  P. Ng,et al.  SIFT missense predictions for genomes , 2015, Nature Protocols.

[24]  P. A. Futreal,et al.  Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. , 2012, The New England journal of medicine.

[25]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[26]  Michael Cariaso,et al.  SNPedia: a wiki supporting personal genome annotation, interpretation and analysis , 2011, Nucleic Acids Res..

[27]  Jana Marie Schwarz,et al.  MutationTaster2: mutation prediction for the deep-sequencing age , 2014, Nature Methods.

[28]  Lilia M. Iakoucheva,et al.  MutPred2: inferring the molecular and phenotypic impact of amino acid variants , 2017, bioRxiv.

[29]  Mauno Vihinen,et al.  VariBench: A Benchmark Database for Variations , 2013, Human mutation.

[30]  S. Gabriel,et al.  Discovery and saturation analysis of cancer genes across 21 tumor types , 2014, Nature.

[31]  Gill Bejerano,et al.  M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity , 2016, Nature Genetics.

[32]  Li Ding,et al.  Protein-structure-guided discovery of functional mutations across 19 cancer types , 2016, Nature Genetics.

[33]  Gustavo Glusman,et al.  Clinical applications of sequencing take center stage , 2013, Genome Biology.

[34]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[35]  C. Atreya,et al.  Combined BRAF and MEK Inhibition With Dabrafenib and Trametinib in BRAF V600-Mutant Colorectal Cancer. , 2015, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[36]  E. Boerwinkle,et al.  dbNSFP v3.0: A One‐Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice‐Site SNVs , 2016, Human mutation.

[37]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[38]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Steven J. M. Jones,et al.  CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer , 2017, Nature Genetics.

[40]  Leyla Isik,et al.  Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. , 2009, Cancer research.

[41]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[42]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[43]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.