Development of pathogenicity predictors specific for variants that do not comply with clinical guidelines for the use of computational evidence

BackgroundStrict guidelines delimit the use of computational information in the clinical setting, due to the still moderate accuracy of in silico tools. These guidelines indicate that several tools should always be used and that full coincidence between them is required if we want to consider their results as supporting evidence in medical decision processes. Application of this simple rule certainly decreases the error rate of in silico pathogenicity assignments. However, when predictors disagree this rule results in the rejection of potentially valuable information for a number of variants. In this work, we focus on these variants of the protein sequence and develop specific predictors to help improve the success rate of their annotation.ResultsWe have used a set of 59,442 protein sequence variants (15,723 pathological and 43,719 neutral) from 228 proteins to identify those cases for which pathogenicity predictors disagree. We have repeated this process for all the possible combinations of five known methods (SIFT, PolyPhen-2, PON-P2, CADD and MutationTaster2). For each resulting subset we have trained a specific pathogenicity predictor. We find that these specific predictors are able to discriminate between neutral and pathogenic variants, with a success rate different from random. They tend to outperform the constitutive methods but this trend decreases as the performance of the constitutive predictor improves (e.g. with PON-P2 and PolyPhen-2). We also find that specific methods outperform standard consensus methods (Condel and CAROL).ConclusionFocusing development efforts on the case of variants for which known methods disagree we may obtain pathogenicity predictors with improved performances. Although we have not yet reached the success rate that allows the use of this computational evidence in a clinical setting, the simplicity of the approach indicates that more advanced methods may reach this goal in a close future.

[1]  Jing Zhang,et al.  Erratum to: The real cost of sequencing: scaling computation to keep pace with data generation , 2016, Genome Biology.

[2]  J. Montaner,et al.  Molecular damage in Fabry disease: Characterization and prediction of alpha‐galactosidase A pathological mutations , 2015, Proteins.

[3]  R. Altman,et al.  Collective judgment predicts disease-associated single nucleotide variants , 2013, BMC Genomics.

[4]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[5]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[6]  D. Fatkin,et al.  Heuristic Methods for Finding Pathogenic Variants in Gene Coding Sequences , 2012, Journal of the American Heart Association.

[7]  P. Soler-Palacín,et al.  Clinical and structural impact of mutations affecting the residue Phe367 of FOXP3 in patients with IPEX syndrome. , 2016, Clinical immunology.

[8]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.

[9]  Matthew S. Lebo,et al.  Development and validation of a computational method for assessment of missense variants in hypertrophic cardiomyopathy. , 2011, American journal of human genetics.

[10]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[11]  Fiona Cunningham,et al.  A Combined Functional Annotation Score for Non-Synonymous Variants , 2012, Human Heredity.

[12]  Jing Hu,et al.  SIFT web server: predicting effects of amino acid substitutions on proteins , 2012, Nucleic Acids Res..

[13]  Roland L. Dunbrack,et al.  The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics , 2013, PloS one.

[14]  Andrew J. Grimm,et al.  Interpreting missense variants: comparing computational methods in human disease genes CDKN2A, MLH1, MSH2, MECP2, and tyrosinase (TYR) , 2007, Human mutation.

[15]  M. Orozco,et al.  Sequence‐based prediction of pathological mutations , 2004, Proteins.

[16]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[17]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[18]  I. Adzhubei,et al.  Predicting Functional Effect of Human Missense Mutations Using PolyPhen‐2 , 2013, Current protocols in human genetics.

[19]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[20]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[21]  Jana Marie Schwarz,et al.  MutationTaster2: mutation prediction for the deep-sequencing age , 2014, Nature Methods.

[22]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[23]  Bale,et al.  Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology , 2015, Genetics in Medicine.

[24]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[25]  Natàlia Padilla,et al.  The Complementarity Between Protein‐Specific and General Pathogenicity Predictors for Amino Acid Substitutions , 2016, Human mutation.

[26]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[27]  M. Vihinen,et al.  PON-P2: Prediction Method for Fast and Reliable Identification of Harmful Variants , 2015, PloS one.

[28]  Shamil R Sunyaev,et al.  Inferring causality and functional significance of human coding DNA variants. , 2012, Human molecular genetics.

[29]  A. Valencia,et al.  KinMutRF: a random forest classifier of sequence variants in the human protein kinase superfamily , 2016, BMC Genomics.

[30]  X. de la Cruz,et al.  Prediction of pathological mutations in proteins: the challenge of integrating sequence conservation and structure stability principles , 2014 .

[31]  K. Stowell,et al.  Comparison of pathogenicity prediction tools on missense variants in RYR1 and CACNA1S associated with malignant hyperthermia. , 2016, British journal of anaesthesia.

[32]  Mauno Vihinen,et al.  Guidelines for Reporting and Using Prediction Tools for Genetic Variation Analysis , 2013, Human mutation.

[33]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[34]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[35]  S. Henikoff,et al.  Predicting the effects of amino acid substitutions on protein function. , 2006, Annual review of genomics and human genetics.

[36]  Mauno Vihinen,et al.  Majority Vote and Other Problems when using Computational Tools , 2014, Human mutation.

[37]  P. Stenson,et al.  The Human Gene Mutation Database (HGMD) and Its Exploitation in the Fields of Personalized Genomics and Molecular Evolution , 2012, Current protocols in bioinformatics.

[38]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[39]  Kai Wang,et al.  Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress , 2012, Genome Medicine.

[40]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[41]  H. Wajcman,et al.  In silico prediction of the deleterious effect of a mutation: proceed with caution in clinical genetics. , 2004, Clinical chemistry.

[42]  M. Vihinen,et al.  Performance of mutation pathogenicity prediction methods on missense variants , 2011, Human mutation.

[43]  Christophe Béroud,et al.  UMD‐predictor, a new prediction tool for nucleotide substitution pathogenicity—application to four genes: FBN1, FBN2, TGFBR1, and TGFBR2 , 2009, Human mutation.