DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels

MOTIVATION Frameshifting (FS) indels and nonsense (NS) variants disrupt the protein-coding sequence downstream of the mutation site by changing the reading frame or introducing a premature termination codon, respectively. Despite such drastic changes to the protein sequence, FS indels and NS variants have been discovered in healthy individuals. How to discriminate disease-causing from neutral FS indels and NS variants is an understudied problem. RESULTS We have built a machine learning method called DDIG-in (FS) based on real human genetic variations from the Human Gene Mutation Database (inherited disease-causing) and the 1000 Genomes Project (GP) (putatively neutral). The method incorporates both sequence and predicted structural features and yields a robust performance by 10-fold cross-validation and independent tests on both FS indels and NS variants. We showed that human-derived NS variants and FS indels derived from animal orthologs can be effectively employed for independent testing of our method trained on human-derived FS indels. DDIG-in (FS) achieves a Matthews correlation coefficient (MCC) of 0.59, a sensitivity of 86%, and a specificity of 72% for FS indels. Application of DDIG-in (FS) to NS variants yields essentially the same performance (MCC of 0.43) as a method that was specifically trained for NS variants. DDIG-in (FS) was shown to make a significant improvement over existing techniques.

[1]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[2]  Amin Zia,et al.  Ranking insertion, deletion and nonsense mutations based on their effect on genetic information , 2011, BMC Bioinformatics.

[3]  J. Miller,et al.  Predicting the Functional Effect of Amino Acid Substitutions and Indels , 2012, PloS one.

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Matthew Mort,et al.  A meta‐analysis of nonsense mutations causing human genetic disease , 2008, Human mutation.

[6]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[7]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[8]  Roy Parker,et al.  Exosome-Mediated Recognition and Degradation of mRNAs Lacking a Termination Codon , 2002, Science.

[9]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[11]  Olivier Poch,et al.  A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i) , 2014, BMC Bioinformatics.

[12]  M. Garcia-Blanco,et al.  Alternative splicing in disease and therapy , 2004, Nature Biotechnology.

[13]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[14]  L. Maquat,et al.  A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. , 1998, Trends in biochemical sciences.

[15]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[16]  Abdul Sattar,et al.  Towards sequence-based prediction of mutation-induced stability changes in unseen non-homologous proteins , 2014, BMC Genomics.

[17]  Paul Flicek,et al.  The functional spectrum of low-frequency coding variation , 2011, Genome Biology.

[18]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[19]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[20]  Daniel R. Zerbino,et al.  Ensembl 2014 , 2013, Nucleic Acids Res..

[21]  P. Radivojac,et al.  MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing , 2014, Genome Biology.

[22]  Jaroslav Bendl,et al.  PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations , 2014, PLoS Comput. Biol..

[23]  A Keith Dunker,et al.  SPINE-D: Accurate Prediction of Short and Long Disordered Regions by a Single Neural-Network Based Method , 2012, Journal of biomolecular structure & dynamics.

[24]  Yuedong Yang,et al.  DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels , 2013, Genome Biology.

[25]  M. Vihinen,et al.  Performance of mutation pathogenicity prediction methods on missense variants , 2011, Human mutation.

[26]  Yaoqi Zhou,et al.  Impact of human pathogenic micro-insertions and micro-deletions on post-transcriptional regulation. , 2014, Human molecular genetics.

[27]  P. Stenson,et al.  The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine , 2013, Human Genetics.

[28]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[29]  Predrag Radivojac,et al.  Automated inference of molecular mechanisms of disease from amino acid substitutions , 2009, Bioinform..

[30]  Yuedong Yang,et al.  Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction. , 2009, Structure.

[31]  L. Hurst The Ka/Ks ratio: diagnosing the form of sequence evolution. , 2002, Trends in genetics : TIG.

[32]  Jonathan M. Mudge,et al.  The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. , 2009, Genome research.

[33]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[34]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[35]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[36]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[37]  Ryan E. Mills,et al.  Natural genetic variation caused by small insertions and deletions in the human genome. , 2011, Genome research.

[38]  Bela Stantic,et al.  Feature-based multiple models improve classification of mutation-induced stability changes , 2014, BMC Genomics.

[39]  Michael Krawczak,et al.  Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity , 2005, Human mutation.

[40]  P. Ng,et al.  Predicting the effects of frameshifting indels , 2012, Genome Biology.

[41]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[42]  P. Ng,et al.  SIFT Indel: Predictions for the Functional Effects of Amino Acid Insertions/Deletions in Proteins , 2013, PloS one.

[43]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.