Assessing performance of pathogenicity predictors using clinically relevant variant datasets

BACKGROUND Pathogenicity predictors are integral to genomic variant interpretation but, despite their widespread usage, an independent validation of performance using a clinically relevant dataset has not been undertaken. METHODS We derive two validation datasets: an 'open' dataset containing variants extracted from publicly available databases, similar to those commonly applied in previous benchmarking exercises, and a 'clinically representative' dataset containing variants identified through research/diagnostic exome and panel sequencing. Using these datasets, we evaluate the performance of three recent meta-predictors, REVEL, GAVIN and ClinPred, and compare their performance against two commonly used in silico tools, SIFT and PolyPhen-2. RESULTS Although the newer meta-predictors outperform the older tools, the performance of all pathogenicity predictors is substantially lower in the clinically representative dataset. Using our clinically relevant dataset, REVEL performed best with an area under the receiver operating characteristic curve of 0.82. Using a concordance-based approach based on a consensus of multiple tools reduces the performance due to both discordance between tools and false concordance where tools make common misclassification. Analysis of tool feature usage may give an insight into the tool performance and misclassification. CONCLUSION Our results support the adoption of meta-predictors over traditional in silico tools, but do not support a consensus-based approach as in current practice.

[1]  A. Siepel,et al.  Probabilities of Fitness Consequences for Point Mutations Across the Human Genome , 2014, Nature Genetics.

[2]  Birgit Sikkema-Raddatz,et al.  GAVIN: Gene-Aware Variant INterpretation for medical sequencing , 2017, Genome Biology.

[3]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[4]  M. Vihinen,et al.  Performance of mutation pathogenicity prediction methods on missense variants , 2011, Human mutation.

[5]  Alejandro Sifrim,et al.  Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data , 2015, The Lancet.

[6]  Jing Hu,et al.  SIFT web server: predicting effects of amino acid substitutions on proteins , 2012, Nucleic Acids Res..

[7]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[8]  H. Carter,et al.  Identifying Mendelian disease genes with the Variant Effect Scoring Tool , 2013, BMC Genomics.

[9]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[10]  P. Stenson,et al.  The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies , 2017, Human Genetics.

[11]  P. Stenson,et al.  The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine , 2013, Human Genetics.

[12]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[13]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[14]  Sharon E. Plon,et al.  Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines , 2017, Genome Biology.

[15]  Karsten M. Borgwardt,et al.  The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity , 2015, Human mutation.

[16]  Xiaohui Xie,et al.  DANN: a deep learning approach for annotating the pathogenicity of genetic variants , 2015, Bioinform..

[17]  Abhishek Niroula,et al.  How good are pathogenicity predictors in detecting benign variants? , 2018, bioRxiv.

[18]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[19]  Chunlei Liu,et al.  ClinVar: improving access to variant interpretations and supporting evidence , 2017, Nucleic Acids Res..

[20]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[21]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[22]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[23]  Bale,et al.  Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology , 2015, Genetics in Medicine.

[24]  Tomas W. Fitzgerald,et al.  Making new genetic diagnoses with old data: iterative reanalysis and reporting from genome-wide data in 1133 families with developmental disorders , 2017, Genetics in Medicine.

[25]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[26]  Predrag Radivojac,et al.  Automated inference of molecular mechanisms of disease from amino acid substitutions , 2009, Bioinform..

[27]  Trevor Hastie,et al.  REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. , 2016, American journal of human genetics.

[28]  Richard Simon,et al.  Overfitting in prediction models - is it a problem only in high dimensions? , 2013, Contemporary clinical trials.

[29]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[30]  Mauno Vihinen,et al.  VariBench: A Benchmark Database for Variations , 2013, Human mutation.

[31]  Fiona Cunningham,et al.  Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP , 2019, Nature Communications.

[32]  R. Gibbs,et al.  Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. , 2015, Human molecular genetics.

[33]  M. Daly,et al.  Regional missense constraint improves variant deleteriousness prediction , 2017, bioRxiv.

[34]  Jacek Majewski,et al.  ClinPred: Prediction Tool to Identify Disease-Relevant Nonsynonymous Single-Nucleotide Variants. , 2018, American journal of human genetics.

[35]  Irina M. Armean,et al.  The mutational constraint spectrum quantified from variation in 141,456 humans , 2019, Nature.