Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome

Differentiation between phenotypically neutral and disease-causing genetic variation remains an open and relevant problem. Among different types of variation, non-frameshifting insertions and deletions (indels) represent an understudied group with widespread phenotypic consequences. To address this challenge, we present a machine learning method, MutPred-Indel, that predicts pathogenicity and identifies types of functional residues impacted by non-frameshifting insertion/deletion variation. The model shows good predictive performance as well as the ability to identify impacted structural and functional residues including secondary structure, intrinsic disorder, metal and macromolecular binding, post-translational modifications, allosteric sites, and catalytic residues. We identify structural and functional mechanisms impacted preferentially by germline variation from the Human Gene Mutation Database, recurrent somatic variation from COSMIC in the context of different cancers, as well as de novo variants from families with autism spectrum disorder. Further, the distributions of pathogenicity prediction scores generated by MutPred-Indel are shown to differentiate highly recurrent from non-recurrent somatic variation. Collectively, we present a framework to facilitate the interrogation of both pathogenicity and the functional effects of non-frameshifting insertion/deletion variants. The MutPred-Indel webserver is available at http://mutpred.mutdb.org/.

[1]  Junfeng Xia,et al.  dbCID: a manually curated resource for exploring the driver indels in human cancer , 2019, Briefings Bioinform..

[2]  Amina Noor,et al.  Frequency and Complexity of De Novo Structural Mutation in Autism. , 2016, American journal of human genetics.

[3]  Yuedong Yang,et al.  DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels , 2013, Genome Biology.

[4]  J. Miller,et al.  Predicting the Functional Effect of Amino Acid Substitutions and Indels , 2012, PloS one.

[5]  Lilia M. Iakoucheva,et al.  When loss-of-function is loss of function: assessing mutational signatures and impact of loss-of-function genetic variants , 2017, Bioinform..

[6]  Michael C. Schatz,et al.  The Challenge of Small-Scale Repeats for Indel Discovery , 2015, Front. Bioeng. Biotechnol..

[7]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[8]  Lilia M. Iakoucheva,et al.  Whole-Genome Sequencing in Autism Identifies Hot Spots for De Novo Germline Mutation , 2012, Cell.

[9]  David L. Masica,et al.  Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST‐Indel) , 2015, Human mutation.

[10]  Predrag Radivojac,et al.  The structural and functional signatures of proteins that undergo multiple events of post‐translational modification , 2014, Protein science : a publication of the Protein Society.

[11]  Peter H. Baenziger,et al.  In silico functional profiling of human disease‐associated and polymorphic amino acid substitutions , 2010, Human mutation.

[12]  V. Vacic,et al.  Disease mutations in disordered regions--exception to the rule? , 2012, Molecular bioSystems.

[13]  Predrag Radivojac,et al.  Influence of Sequence Changes and Environment on Intrinsically Disordered Proteins , 2009, PLoS Comput. Biol..

[14]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[15]  L. Langman,et al.  The challenges of personalized medicine. , 2012, Clinical biochemistry.

[16]  Ning Zhang,et al.  Discriminating between deleterious and neutral non-frameshifting indels based on protein interaction networks and hybrid properties , 2014, Molecular Genetics and Genomics.

[17]  J. Moult,et al.  SNPs, protein structure, and disease , 2001, Human mutation.

[18]  N. Rosenfeld,et al.  The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes , 2016, Nature Communications.

[19]  Benjamin A. Shoemaker,et al.  IBIS (Inferred Biomolecular Interaction Server) reports, predicts and integrates multiple types of conserved interactions for proteins , 2011, Nucleic Acids Res..

[20]  Gary D. Bader,et al.  The mutational landscape of phosphorylation signaling in cancer , 2013, Scientific Reports.

[21]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[22]  A. Panchenko,et al.  Annotating Mutational Effects on Proteins and Protein Interactions: Designing Novel and Revisiting Existing Protocols. , 2017, Methods in molecular biology.

[23]  Russ B. Altman,et al.  Bioinformatics challenges for personalized medicine , 2011, Bioinform..

[24]  Predrag Radivojac,et al.  MutDB: update on development of tools for the biochemical analysis of genetic variation , 2007, Nucleic Acids Res..

[25]  Christopher S. Poultney,et al.  Synaptic, transcriptional, and chromatin genes disrupted in autism , 2014, Nature.

[26]  Xinghua Shi,et al.  Effects of short indels on protein structure and function in human genomes , 2017, Scientific Reports.

[27]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[28]  Anna R. Panchenko,et al.  Computational Approaches to Prioritize Cancer Driver Missense Mutations , 2018, International journal of molecular sciences.

[29]  Anna R. Panchenko,et al.  MutaBind estimates and interprets the effects of sequence variants on protein–protein interactions , 2016, Nucleic Acids Res..

[30]  G. Mills,et al.  CanDrA: Cancer-Specific Driver Missense Mutation Annotation with Optimized Features , 2013, PloS one.

[31]  P. Stenson,et al.  The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies , 2017, Human Genetics.

[32]  Leyla Isik,et al.  Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. , 2009, Cancer research.

[33]  P. Radivojac,et al.  MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing , 2014, Genome Biology.

[34]  Gad Getz,et al.  Analysis of somatic microsatellite indels identifies driver events in human tumors , 2017, Nature Biotechnology.

[35]  H. Carter,et al.  Structure-Based Analysis Reveals Cancer Missense Mutations Target Protein Interaction Interfaces , 2016, PloS one.

[36]  P. Ng,et al.  SIFT Indel: Predictions for the Functional Effects of Amino Acid Insertions/Deletions in Proteins , 2013, PloS one.

[37]  Predrag Radivojac,et al.  Automated inference of molecular mechanisms of disease from amino acid substitutions , 2009, Bioinform..

[38]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[39]  Boris Yamrom,et al.  The contribution of de novo coding mutations to autism spectrum disorder , 2014, Nature.

[40]  David Haussler,et al.  The UCSC Genome Browser database: 2014 update , 2013, Nucleic Acids Res..

[41]  Olivier Poch,et al.  A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i) , 2014, BMC Bioinformatics.

[42]  Tom R. Gaunt,et al.  Predicting the functional consequences of cancer-associated amino acid substitutions , 2013, Bioinform..

[43]  Thomas A. Peterson,et al.  Towards precision medicine: advances in computational approaches for the analysis of human variants. , 2013, Journal of molecular biology.

[44]  Benjamin A. Shoemaker,et al.  Cancer Missense Mutations Alter Binding Properties of Proteins and Their Interaction Networks , 2013, PloS one.

[45]  Akane Kawamura,et al.  Potent and Selective KDM5 Inhibitor Stops Cellular Demethylation of H3K4me3 at Transcription Start Sites and Proliferation of MM1S Myeloma Cells , 2017, Cell chemical biology.

[46]  Liangjiang Wang,et al.  Sequence feature-based prediction of protein stability changes upon amino acid substitutions , 2010, BMC Genomics.

[47]  Jonathan Sebat,et al.  SV2: Accurate Structural Variation Genotyping and De Novo Mutation Detection from Whole Genomes , 2017, bioRxiv.

[48]  T. Kunkel,et al.  Mechanism of a genetic glissando: structural biology of indel mutations. , 2006, Trends in biochemical sciences.

[49]  B. Rost,et al.  Protein function in precision medicine: deep understanding with machine learning , 2016, FEBS letters.

[50]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[51]  Narmada Thanki,et al.  CDD: NCBI's conserved domain database , 2014, Nucleic Acids Res..

[52]  Chi-Ren Shyu,et al.  Determining Effects of Non-synonymous SNPs on Protein-Protein Interactions using Supervised and Semi-supervised Learning , 2014, PLoS Comput. Biol..

[53]  Martha White,et al.  Estimating the class prior and posterior from noisy positives and unlabeled data , 2016, NIPS.

[54]  Zoran Obradovic,et al.  Length-dependent prediction of protein intrinsic disorder , 2006, BMC Bioinformatics.

[55]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[56]  Alan M. Moses,et al.  Polymorphism Analysis Reveals Reduced Negative Selection and Elevated Rate of Insertions and Deletions in Intrinsically Disordered Protein Regions , 2015, Genome biology and evolution.

[57]  Euan A Ashley,et al.  Clinical interpretation and implications of whole-genome sequencing. , 2014, JAMA.

[58]  Mark Diekhans,et al.  MuPIT interactive: webserver for mapping variant positions to annotated, interactive 3D structures , 2013, Human Genetics.

[59]  C. Lord,et al.  The Simons Simplex Collection: A Resource for Identification of Autism Genetic Risk Factors , 2010, Neuron.

[60]  T. Hubbard,et al.  A census of human cancer genes , 2004, Nature Reviews Cancer.

[61]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[62]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[63]  Lilia M. Iakoucheva,et al.  Paternally inherited cis-regulatory structural variants are associated with autism , 2018, Science.

[64]  A Keith Dunker,et al.  Calmodulin signaling: Analysis and prediction of a disorder‐dependent molecular recognition , 2006, Proteins.

[65]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[66]  Euan A Ashley,et al.  Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis , 2017, American journal of epidemiology.

[67]  Jimin Pei,et al.  AL2CO: calculation of positional conservation in a protein sequence alignment , 2001, Bioinform..

[68]  Melissa S. Cline,et al.  Using bioinformatics to predict the functional impact of SNVs , 2011, Bioinform..

[69]  Predrag Radivojac,et al.  The Loss and Gain of Functional Amino Acid Residues Is a Common Mechanism Causing Human Inherited Disease , 2016, PLoS Comput. Biol..

[70]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[71]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[72]  Arlo Z. Randall,et al.  Prediction of protein stability changes for single‐site mutations using support vector machines , 2005, Proteins.

[73]  Mark Gerstein,et al.  The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes , 2013, Genome research.

[74]  Lilia M. Iakoucheva,et al.  MutPred2: inferring the molecular and phenotypic impact of amino acid variants , 2017, bioRxiv.

[75]  L. Serrano,et al.  Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. , 2002, Journal of molecular biology.

[76]  P. Radivojac,et al.  Protein flexibility and intrinsic disorder , 2004, Protein science : a publication of the Protein Society.

[77]  L. Vissers,et al.  Meta-analysis of 2,104 trios provides support for 10 new genes for intellectual disability , 2016, Nature Neuroscience.

[78]  Junfeng Xia,et al.  dbCPM: a manually curated database for exploring the cancer passenger mutations , 2018, Briefings Bioinform..

[79]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[80]  Predrag Radivojac,et al.  Gain and Loss of Phosphorylation Sites in Human Cancer , 2022 .

[81]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[82]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[83]  Ignacio E. Sánchez,et al.  The eukaryotic linear motif resource ELM: 10 years and counting , 2013, Nucleic Acids Res..

[84]  Lilia M. Iakoucheva,et al.  Loss of Post-Translational Modification Sites in Disease , 2010, Pacific Symposium on Biocomputing.