Biological function polarity prediction of missense variants using machine learning

Functional interpretation is crucial when facing on average 20,000 missense variants per human exome, as the great majority are not associated with any underlying disease. In silico bioinformatics tools can predict the deleteriousness of variants or assess their functional impact by assigning scores, but they cannot predict whether the variant in question results in gain or loss of function at the protein level. Here, we show that machine learning can effectively predict this biological function polarity of missense variants. The new method adapts weighted gradient boosting machine approach on a set of damaging variants (1,288 loss of function and 218 gain of function variants) as annotated by the tools SIFT, PolyPhen2 and CADD. Area under the ROC curve of 0.85 illustrates high discriminative power of the classifier. Predictive performance of the classifier remains consistent against an independent set of damaging variants as highlighted by the area under the ROC curve of 0.83. This new approach may help to guide biological experiments on the clinical relevance of damaging genetic variants. Author summary Missense variant occurs when a single genetic alteration in DNA takes place and as a result a new amino acid is translated into the protein. This amino acid change can inactivate the existing protein function causing loss-of-function or produce a new function causing gain-of-function. Therefore, it is very important to interpret these functional consequences of missense variants as they often turn out to be disease causing. Each individual’s genome sequence has thousands of missense variants, out of which very few are actually associated with any underlying disease. Various computational tools have been developed to predict whether missense variants are damaging or not, but none of them can actually predict whether the damaging missense variants cause gain-of-function or loss-of-function. We have developed a new ensemble classifier to predict this biological function polarity at the protein level. The classifier combines the prediction scores of three existing bioinformatics tools and applies machine learning to make effective predictions. We have validated our classifier against an independent data set to show its high predictive power and robustness. The predictions made by our machine learning tool can be used as indicators of biological function polarity, but with further evidence on pathogenicity.

[1]  E. Boerwinkle,et al.  dbNSFP v3.0: A One‐Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice‐Site SNVs , 2016, Human mutation.

[2]  Zhihua Cai,et al.  Evaluation Measures of the Classification Performance of Imbalanced Data Sets , 2009 .

[3]  E. Boerwinkle,et al.  dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[4]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[5]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[6]  Thomas Schlitt,et al.  Predicting the functional consequences of non-synonymous DNA sequence variants--evaluation of bioinformatics tools and development of a consensus strategy. , 2013, Genomics.

[7]  Lluis Quintana-Murci,et al.  The mutation significance cutoff: gene-level thresholds for variant predictions , 2016, Nature Methods.

[8]  Kurt Hornik,et al.  Implementing a Class of Permutation Tests: The coin Package , 2008 .

[9]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[10]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[11]  Steven M. Gallo,et al.  REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in Drosophila , 2010, Nucleic Acids Res..

[12]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[13]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[14]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[15]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[16]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[17]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[18]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[19]  Sanjay Chawla,et al.  On the Statistical Consistency of Algorithms for Binary Classification under Class Imbalance , 2013, ICML.

[20]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[23]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[24]  S. Ellard,et al.  Using SIFT and PolyPhen to predict loss-of-function and gain-of-function mutations. , 2010, Genetic testing and molecular biomarkers.