The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen‐2, SIFT, FatHMM, MutationTaster‐2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

[1]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[2]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[3]  E. Boerwinkle,et al.  dbNSFP v2.0: A Database of Human Non‐synonymous SNVs and Their Functional Predictions and Annotations , 2013, Human mutation.

[4]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[5]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[8]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[9]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[12]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[13]  Johnny S. H. Kwan,et al.  Predicting Mendelian Disease-Causing Non-Synonymous Single Nucleotide Variants in Exome Sequencing Studies , 2013, PLoS genetics.

[14]  Adam Kiezun,et al.  Exome sequencing and the genetic basis of complex traits , 2012, Nature Genetics.

[15]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[16]  Anaïs Mottaz,et al.  Bioinformatics Applications Note Databases and Ontologies Easy Retrieval of Single Amino-acid Polymorphisms and Phenotype Information Using Swissvar , 2022 .

[17]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[18]  M. Vihinen How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis , 2012, BMC Genomics.

[19]  Giuliana Silvestri,et al.  Next-generation sequencing-based molecular diagnosis of 82 retinitis pigmentosa probands from Northern Ireland , 2014, Human Genetics.

[20]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[21]  Andrew M. Jenkinson,et al.  Ensembl 2009 , 2008, Nucleic Acids Res..

[22]  Trevor Hastie,et al.  Model Assessment and Selection , 2009 .

[23]  Robert Gentleman,et al.  Comprehensive genomic analysis identifies SOX2 as a frequently amplified gene in small-cell lung cancer , 2012, Nature Genetics.

[24]  Michael Peyton,et al.  Systematic Identification of Molecular Subtype-Selective Vulnerabilities in Non-Small-Cell Lung Cancer , 2013, Cell.

[25]  K. Fliessbach,et al.  Human Mutation , 2017 .

[26]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[27]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[28]  M. Vihinen,et al.  Performance of mutation pathogenicity prediction methods on missense variants , 2011, Human mutation.

[29]  Simion I. Chiosea,et al.  Hotspot activating PRKD1 somatic mutations in polymorphous low-grade adenocarcinomas of the salivary glands , 2014, Nature Genetics.

[30]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[31]  A. Toutain,et al.  Mutations in SLC13A5 cause autosomal-recessive epileptic encephalopathy with seizure onset in the first days of life. , 2014, American journal of human genetics.

[32]  Mauno Vihinen,et al.  VariBench: A Benchmark Database for Variations , 2013, Human mutation.

[33]  Jaroslav Bendl,et al.  PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations , 2014, PLoS Comput. Biol..

[34]  D. Easton,et al.  Germline BRCA1 mutations increase prostate cancer risk , 2012, British Journal of Cancer.

[35]  Joseph K. Pickrell,et al.  A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes , 2012, Science.

[36]  Mauno Vihinen,et al.  Guidelines for Reporting and Using Prediction Tools for Genetic Variation Analysis , 2013, Human mutation.

[37]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[38]  Daniel R. Zerbino,et al.  Ensembl 2014 , 2013, Nucleic Acids Res..

[39]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[40]  R. Altman,et al.  Collective judgment predicts disease-associated single nucleotide variants , 2013, BMC Genomics.

[41]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[42]  Eric S. Lander,et al.  A polygenic burden of rare disruptive mutations in schizophrenia , 2014, Nature.

[43]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[44]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[45]  Jana Marie Schwarz,et al.  MutationTaster2: mutation prediction for the deep-sequencing age , 2014, Nature Methods.

[46]  P. Stenson,et al.  The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine , 2013, Human Genetics.