Variation benchmark datasets: update, criteria, quality and applications

Development of new computational methods and testing their performance has to be done on experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets. They have been used for training and benchmarking predictors for various types of variations and their effects. There are 419 new datasets from 109 papers containing altogether 329003373 variants; however there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property predictions for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performance to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and showed that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. AUTHOR SUMMARY A prediction method performance can only be assessed in comparison to existing knowledge. For that purpose benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. We collected variation datasets from literature, website and databases. There are 419 separate new datasets, which however contain plenty of redundancy. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property predictions for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. The updated VariBench facilitates development and testing of new methods and comparison of obtained performance to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies and showed that such comparisons are possible and useful when the details of studies and the datasets are shared.

[1]  M. Vihinen,et al.  Benchmarking subcellular localization and variant tolerance predictors on membrane proteins , 2019, BMC Genomics.

[2]  Sigve Nakken,et al.  Effects of intronic mutations in the LDLR gene on pre-mRNA splicing: Comparison of wet-lab and bioinformatics analyses. , 2009, Molecular genetics and metabolism.

[3]  Li Yang,et al.  Predicting disease-associated substitution of a single amino acid by analyzing residue interactions , 2011, BMC Bioinformatics.

[4]  Joost J. J. van Durme,et al.  WALTZ-DB: a benchmark database of amyloidogenic hexapeptides , 2015, Bioinform..

[5]  P. Radivojac,et al.  MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing , 2014, Genome Biology.

[6]  David R. Westhead,et al.  KvSNP: accurately predicting the effect of genetic variants in voltage-gated potassium channels , 2011, Bioinform..

[7]  Yuan Tian,et al.  A Bayesian framework for efficient and accurate variant prediction , 2018, PloS one.

[8]  M. Sternberg,et al.  SuSPect: Enhanced Prediction of Single Amino Acid Variant (SAV) Phenotype Using Network Features , 2014, Journal of molecular biology.

[9]  Abhishek Niroula,et al.  PON-mt-tRNA: a multifactorial probability-based method for classification of mitochondrial tRNA variations , 2016, Nucleic acids research.

[10]  Bela Stantic,et al.  EASE-MM: Sequence-Based Prediction of Mutation-Induced Stability Changes with Feature-Based Multiple Models. , 2016, Journal of molecular biology.

[11]  I. Vořechovský,et al.  Aberrant 3′ splice sites in human disease genes: mutation pattern, nucleotide structure and comparison of computational tools that predict their utilization , 2006, Nucleic acids research.

[12]  Rachel Karchin,et al.  Missense variants in CFTR nucleotide-binding domains predict quantitative phenotypes associated with cystic fibrosis disease severity. , 2015, Human molecular genetics.

[13]  Ogun Adebali,et al.  Establishing the precise evolutionary history of a gene improves prediction of disease-causing missense mutations , 2016, Genetics in Medicine.

[14]  M. N. Ponnuswamy,et al.  Average assignment method for predicting the stability of protein mutants , 2006, Biopolymers.

[15]  Yi Zhang,et al.  Performance evaluation of pathogenicity-computation methods for missense variants , 2018, Nucleic acids research.

[16]  Bale,et al.  Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology , 2015, Genetics in Medicine.

[17]  M. Vihinen,et al.  Variation Interpretation Predictors: Principles, Types, Performance, and Choice , 2016, Human mutation.

[18]  John M. Hancock,et al.  Human Variome Project Quality Assessment Criteria for Variation Databases , 2016, Human mutation.

[19]  Colin Campbell,et al.  An integrative approach to predicting the functional effects of non-coding and coding sequence variation , 2015, Bioinform..

[20]  Daniel Lai,et al.  Assessment of the predictive accuracy of five in silico prediction tools, alone or in combination, and two metaservers to classify long QT syndrome gene mutations , 2015, BMC Medical Genetics.

[21]  Mauno Vihinen,et al.  PON-SC – program for identifying steric clashes caused by amino acid substitutions , 2017, BMC Bioinformatics.

[22]  M. Vihinen,et al.  Harmful somatic amino acid substitutions affect key pathways in cancers , 2015, BMC Medical Genomics.

[23]  Junfeng Xia,et al.  Computational identification of deleterious synonymous variants in human genomes using a feature-based approach , 2019, BMC Medical Genomics.

[24]  Douglas E. V. Pires,et al.  mCSM: predicting the effects of mutations in proteins using graph-based signatures , 2013, Bioinform..

[25]  Mauno Vihinen,et al.  VariSNP, A Benchmark Database for Variations From dbSNP , 2015, Human mutation.

[26]  R. Gibbs,et al.  Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. , 2015, Human molecular genetics.

[27]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[28]  Abhishek Niroula,et al.  Classification of Amino Acid Substitutions in Mismatch Repair Proteins Using PON‐MMR2 , 2015, Human mutation.

[29]  Zhong Ren,et al.  Annotating pathogenic non-coding variants in genic regions , 2017, Nature Communications.

[30]  Eliseos J Mucaki,et al.  Prediction of Mutant mRNA Splice Isoforms by Information Theory‐Based Exon Definition , 2013, Human mutation.

[31]  P. Radivojac,et al.  Prediction of functional regulatory SNPs in monogenic and complex disease , 2011, Human mutation.

[32]  Cheng Zhang,et al.  Gene‐Specific Variant Classifier (DPYD‐Varifier) to Identify Deleterious Alleles of Dihydropyrimidine Dehydrogenase , 2018, Clinical pharmacology and therapeutics.

[33]  Timo Lassmann,et al.  A phenotype centric benchmark of variant prioritisation tools , 2018, npj Genomic Medicine.

[34]  Moriah H Nissan,et al.  OncoKB: A Precision Oncology Knowledge Base. , 2017, JCO precision oncology.

[35]  Tom R. Gaunt,et al.  Ranking non-synonymous single nucleotide polymorphisms based on disease concepts , 2014, Human Genomics.

[36]  S. Antonarakis,et al.  Mutation nomenclature extensions and suggestions to describe complex mutations: A discussion , 2000 .

[37]  Jaroslav Bendl,et al.  PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions , 2016, PLoS Comput. Biol..

[38]  Eric Boerwinkle,et al.  In silico prediction of splice-altering single nucleotide variants in the human genome , 2014, Nucleic acids research.

[39]  Jörg Hakenberg,et al.  Disease-associated variants in different categories of disease located in distinct regulatory elements , 2015, BMC Genomics.

[40]  Bairong Shen,et al.  Structure-based prediction of the effects of a missense variant on protein stability , 2012, Amino Acids.

[41]  M. Vihinen,et al.  Performance of mutation pathogenicity prediction methods on missense variants , 2011, Human mutation.

[42]  Marianne Rooman,et al.  Predicting protein thermal stability changes upon point mutations using statistical potentials: Introducing HoTMuSiC , 2016, Scientific Reports.

[43]  Antonio Rausell,et al.  NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans , 2019, Genome Biology.

[44]  Peter Devilee,et al.  Intronic variants in BRCA1 and BRCA2 that affect RNA splicing can be reliably selected by splice‐site prediction programs , 2009, Human mutation.

[45]  Yang Yang,et al.  PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality , 2018, International journal of molecular sciences.

[46]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[47]  Yitian Zhou,et al.  An optimized prediction framework to assess the functional impact of pharmacogenetic variants , 2018, The Pharmacogenomics Journal.

[48]  Ken Chen,et al.  Systematic Functional Annotation of Somatic Mutations in Cancer. , 2018, Cancer cell.

[49]  Yang Yang,et al.  PON-Sol: prediction of effects of amino acid substitutions on protein solubility , 2016, Bioinform..

[50]  Juan Fernández-Recio,et al.  SKEMPI 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation , 2018, bioRxiv.

[51]  Karsten M. Borgwardt,et al.  The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity , 2015, Human mutation.

[52]  Christophe Béroud,et al.  Bioinformatics identification of splice site signals and prediction of mutation effects , 2010 .

[53]  Aleksey Porollo,et al.  MutaCYP: Classification of missense mutations in human cytochromes P450 , 2014, BMC Medical Genomics.

[54]  Yang Gao,et al.  Predicting folding free energy changes upon single point mutations , 2012, Bioinform..

[55]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[56]  Natàlia Padilla,et al.  The Complementarity Between Protein‐Specific and General Pathogenicity Predictors for Amino Acid Substitutions , 2016, Human mutation.

[57]  Anna R. Panchenko,et al.  Exploring background mutational processes to decipher cancer genetic heterogeneity , 2017, Nucleic Acids Res..

[58]  Inês Barroso,et al.  Prospective functional classification of all possible missense variants in PPARG , 2016, Nature Genetics.

[59]  Malgorzata Kotulska,et al.  AmyLoad: website dedicated to amyloidogenic protein fragments , 2015, Bioinform..

[60]  Marianne Rooman,et al.  Quantification of biases in predictions of protein stability changes upon mutations , 2018, bioRxiv.

[61]  Nicholas J. Schork,et al.  Accurate prediction of deleterious protein kinase polymorphisms , 2007, Bioinform..

[62]  Joshua F. McMichael,et al.  DoCM: a database of curated mutations in cancer , 2016, Nature Methods.

[63]  Cristina Marino Buslje,et al.  Kin-Driver: a database of driver mutations in protein kinases , 2014, Database J. Biol. Databases Curation.

[64]  Piero Fariselli,et al.  PhD-SNPg: a webserver and lightweight tool for scoring single nucleotide variants , 2017, Nucleic Acids Res..

[65]  Abhishek Niroula,et al.  How good are pathogenicity predictors in detecting benign variants? , 2018, bioRxiv.

[66]  Ladislav Dušek,et al.  Exon First Nucleotide Mutations in Splicing: Evaluation of In Silico Prediction Tools , 2014, PloS one.

[67]  Junfeng Xia,et al.  dbCPM: a manually curated database for exploring the cancer passenger mutations , 2018, Briefings Bioinform..

[68]  Sean B. Johnston,et al.  PTENpred: A Designer Protein Impact Predictor for PTEN-related Disorders , 2016, J. Comput. Biol..

[69]  Sharon E. Plon,et al.  Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines , 2017, Genome Biology.

[70]  Filomena Ficarazzi,et al.  Comparative In Vitro and In Silico Analyses of Variants in Splicing Regions of BRCA1 and BRCA2 Genes and Characterization of Novel Pathogenic Mutations , 2013, PloS one.

[71]  Liang-Tsung Huang,et al.  Reliable prediction of protein thermostability change upon double mutation from amino acid sequence , 2009, Bioinform..

[72]  Debra O. Prosser,et al.  Evaluation of Bioinformatic Programmes for the Analysis of Variants within Splice Site Consensus Regions , 2016, Adv. Bioinformatics.

[73]  Xiaohui Xie,et al.  DANN: a deep learning approach for annotating the pathogenicity of genetic variants , 2015, Bioinform..

[74]  Silvio C. E. Tosatto,et al.  Correct machine learning on protein sequences: a peer-reviewing perspective , 2016, Briefings Bioinform..

[75]  Richard Bonneau,et al.  Robust classification of protein variation using structural modelling and large-scale data integration , 2015, bioRxiv.

[76]  Giorgio Valentini,et al.  A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. , 2016, American journal of human genetics.

[77]  Ilia Korvigo,et al.  Generalising better: Applying deep learning to integrate deleteriousness prediction scores for whole-exome SNV studies , 2017, bioRxiv.

[78]  Kyle Trainor,et al.  Computational tools help improve protein stability but with a solubility tradeoff , 2017, The Journal of Biological Chemistry.

[79]  Ali Torkamani,et al.  Distribution analysis of nonsynonymous polymorphisms within the human kinase gene family. , 2007, Genomics.

[80]  Ivet Bahar,et al.  Structural dynamics is a determinant of the functional significance of missense variants , 2018, Proceedings of the National Academy of Sciences.

[81]  M. Vihinen How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis , 2012, BMC Genomics.

[82]  M. Vihinen,et al.  PON-P2: Prediction Method for Fast and Reliable Identification of Harmful Variants , 2015, PloS one.

[83]  Dan B Jensen,et al.  Bayesian prediction of bacterial growth temperature range based on genome sequences , 2012, BMC Genomics.

[84]  Mauno Vihinen,et al.  Guidelines for Reporting and Using Prediction Tools for Genetic Variation Analysis , 2013, Human mutation.

[85]  Olivier Poch,et al.  A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i) , 2014, BMC Bioinformatics.

[86]  Matthew Mort,et al.  Improving the in silico assessment of pathogenicity for compensated variants , 2016, European Journal of Human Genetics.

[87]  Yang Zhang,et al.  STRUM: structure-based prediction of protein stability changes upon single-point mutation , 2016, Bioinform..

[88]  Junfeng Xia,et al.  dbDSM: a manually curated database for deleterious synonymous mutations , 2016, Bioinform..

[89]  Abhishek Niroula,et al.  Predicting Severity of Disease‐Causing Variants , 2017, Human mutation.

[90]  Jean-Philippe Vert,et al.  Guidelines for splicing analysis in molecular diagnosis derived from a set of 327 combined in silico/in vitro studies on BRCA1 and BRCA2 variants , 2012, Human mutation.

[91]  Frederick E. Dewey,et al.  MAPPIN: a method for annotating, predicting pathogenicity and mode of inheritance for nonsynonymous variants , 2017, Nucleic acids research.

[92]  Piero Fariselli,et al.  A three-state prediction of single point mutations on protein stability changes , 2007, BMC Bioinformatics.

[93]  Rachel Karchin,et al.  Towards Increasing the Clinical Relevance of In Silico Methods to Predict Pathogenic Missense Variants , 2016, PLoS Comput. Biol..

[94]  Siddhaling Urolagin,et al.  Performance of Protein Disorder Prediction Programs on Amino Acid Substitutions , 2014, Human mutation.

[95]  Jeffrey Skolnick,et al.  ENTPRISE-X: Predicting disease-associated frameshift and nonsense mutations , 2018, PloS one.

[96]  Yunlong Liu,et al.  DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels , 2015, Bioinform..

[97]  Markus Heinonen,et al.  Flex ddG: Rosetta ensemble-based estimation of changes in protein-protein binding affinity upon mutation , 2017, bioRxiv.

[98]  A. Valencia,et al.  Prioritization of pathogenic mutations in the protein kinase superfamily , 2012, BMC Genomics.

[99]  Liang-Tsung Huang,et al.  iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations , 2007, Bioinform..

[100]  E. Alexov,et al.  SAAFEC: Predicting the Effect of Single Point Mutations on Protein Folding Free Energy Using a Knowledge-Modified MM/PBSA Approach , 2016, International journal of molecular sciences.

[101]  C. Béroud,et al.  Human Splicing Finder: an online bioinformatics tool to predict splicing signals , 2009, Nucleic acids research.

[102]  M. Vihinen,et al.  KinMutBase: A registry of disease‐causing mutations in protein kinase domains , 2005, Human mutation.

[103]  Douglas E. V. Pires,et al.  Kinact: a computational approach for predicting activating missense mutations in protein kinases , 2018, Nucleic Acids Res..

[104]  Kerstin Becker,et al.  BRCA1/2 missense mutations and the value of in-silico analyses. , 2017, European journal of medical genetics.

[105]  Mauno Vihinen,et al.  Performance of protein stability predictors , 2010, Human mutation.

[106]  Jaroslav Bendl,et al.  PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations , 2014, PLoS Comput. Biol..

[107]  Roberto Vera Alvarez,et al.  Quantifying deleterious effects of regulatory variants , 2016, Nucleic acids research.

[108]  Piero Fariselli,et al.  I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure , 2005, Nucleic Acids Res..

[109]  D. Baker,et al.  Role of conformational sampling in computing mutation‐induced changes in protein structure and stability , 2011, Proteins.

[110]  Bing Ren,et al.  The human noncoding genome defined by genetic diversity , 2018, Nature Genetics.

[111]  M. Vihinen,et al.  Prediction of disease-related mutations affecting protein localization , 2009, BMC Genomics.

[112]  R. E. Tully,et al.  Locus Reference Genomic sequences: an improved basis for describing human DNA variants , 2010, Genome Medicine.

[113]  Dominique Vaur,et al.  Contribution of bioinformatics predictions and functional splicing assays to the interpretation of unclassified variants of the BRCA genes , 2011, European Journal of Human Genetics.

[114]  Enrique F Schisterman,et al.  A gene-specific method for predicting hemophilia-causing point mutations. , 2013, Journal of molecular biology.

[115]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[116]  Xavier de la Cruz,et al.  Development of pathogenicity predictors specific for variants that do not comply with clinical guidelines for the use of computational evidence , 2017, BMC Genomics.

[117]  P. Ng,et al.  SIFT Indel: Predictions for the Functional Effects of Amino Acid Insertions/Deletions in Proteins , 2013, PloS one.

[118]  M. Vihinen Variation Ontology for annotation of variation effects and mechanisms , 2014, Genome research.

[119]  Kim D Pruitt,et al.  RefSeq curation and annotation of stop codon recoding in vertebrates , 2018, Nucleic acids research.

[120]  Iosif I. Vaisman,et al.  Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis , 2008, Bioinform..

[121]  Mauno Vihinen,et al.  Characterization of All Possible Single‐Nucleotide Change Caused Amino Acid Substitutions in the Kinase Domain of Bruton Tyrosine Kinase , 2015, Human mutation.

[122]  Dominique Stoppa-Lyonnet,et al.  Evaluation of in silico splice tools for decision‐making in molecular diagnosis , 2008, Human mutation.

[123]  Mauno Vihinen,et al.  How to Define Pathogenicity, Health, and Disease? , 2017, Human mutation.

[124]  Jana Marie Schwarz,et al.  MutationTaster2: mutation prediction for the deep-sequencing age , 2014, Nature Methods.

[125]  D. Baker,et al.  A simple physical model for binding energy hot spots in protein–protein complexes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[126]  Yutaka Saito,et al.  Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions , 2015, BMC Genomics.

[127]  Matthew S. Lebo,et al.  Development and validation of a computational method for assessment of missense variants in hypertrophic cardiomyopathy. , 2011, American journal of human genetics.

[128]  Mauno Vihinen,et al.  VariBench: A Benchmark Database for Variations , 2013, Human mutation.

[129]  J. Reis-Filho,et al.  Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations , 2014, Genome Biology.

[130]  Abhishek Niroula,et al.  PON‐P and PON‐P2 predictor performance in CAGI challenges: Lessons learned , 2017, Human mutation.

[131]  Andreas Prlić,et al.  Impact of genetic variation on three dimensional structure and function of proteins , 2017, PloS one.

[132]  Mauno Vihinen,et al.  Representativeness of variation benchmark datasets , 2018, BMC Bioinformatics.

[133]  G. Schreiber,et al.  Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. , 2009, Protein engineering, design & selection : PEDS.