VariBench: A Benchmark Database for Variations

Several computational methods have been developed for predicting the effects of rapidly expanding variation data. Comparison of the performance of tools has been very difficult as the methods have been trained and tested with different datasets. Until now, unbiased and representative benchmark datasets have been missing. We have developed a benchmark database suite, VariBench, to overcome this problem. VariBench contains datasets of experimentally verified high‐quality variation data carefully chosen from literature and relevant databases. It provides the mapping of variation position to different levels (protein, RNA and DNA sequences, protein three‐dimensional structure), along with identifier mapping to relevant databases. VariBench contains the first benchmark datasets for variation effect analysis, a field which is of high importance and where many developments are currently going on. VariBench datasets can be used, for example, to test performance of prediction tools as well as to train novel machine learning‐based tools. New datasets will be included and the community is encouraged to submit high‐quality datasets to the service. VariBench is freely available at http://structure.bmc.lu.se/VariBench.

[1]  Peter H. Baenziger,et al.  In silico functional profiling of human disease‐associated and polymorphic amino acid substitutions , 2010, Human mutation.

[2]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[3]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[4]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[5]  András Kocsor,et al.  A Protein Classification Benchmark collection for machine learning , 2007, Nucleic Acids Res..

[6]  M. Vihinen,et al.  Performance of mutation pathogenicity prediction methods on missense variants , 2011, Human mutation.

[7]  G. Schreiber,et al.  Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. , 2009, Protein engineering, design & selection : PEDS.

[8]  John K. Lyon,et al.  What is a database , 1973, SGMD.

[9]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[10]  George P Patrinos,et al.  Locus‐specific database domain and data content analysis: evolution and content maturation toward clinical use a , 2010, Human mutation.

[11]  L. Serrano,et al.  Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. , 2002, Journal of molecular biology.

[12]  M. Vihinen How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis , 2012, BMC Genomics.

[13]  Christophe Béroud,et al.  Bioinformatics identification of splice site signals and prediction of mutation effects , 2010 .

[14]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[15]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[16]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[17]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[18]  M. Vihinen,et al.  Immunodeficiency mutation databases (IDbases). , 1998, Human mutation.

[19]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[20]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[23]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[24]  Syed Haider,et al.  Ensembl BioMarts: a hub for data retrieval across taxonomic space , 2011, Database J. Biol. Databases Curation.

[25]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[26]  Debasis Dash,et al.  HGVbaseG2P: a central genetic association database , 2008, Nucleic Acids Res..

[27]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[28]  Alexander V. Diemand,et al.  The Swiss‐Prot variant page and the ModSNP database: A resource for sequence and structure information on human protein variants , 2004, Human mutation.

[29]  Harri Lähdesmäki,et al.  Systematic Analysis of Disease-Related Regulatory Mutation Classes Reveals Distinct Effects on Transcription Factor Binding , 2009, Silico Biol..

[30]  John L Hopper,et al.  Classifying MLH1 and MSH2 variants using bioinformatic prediction, splicing assays, segregation, and tumor characteristics , 2009, Human mutation.

[31]  G. Church,et al.  Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset , 2005, Genome Biology.

[32]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[33]  Ourania Horaitis,et al.  A database of locus-specific databases , 2007, Nature Genetics.

[34]  Sue Povey,et al.  The Human Variome Project , 2008, Science.

[35]  Mauno Vihinen,et al.  Performance of protein stability predictors , 2010, Human mutation.

[36]  R. E. Tully,et al.  Locus Reference Genomic sequences: an improved basis for describing human DNA variants , 2010, Genome Medicine.

[37]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[38]  J. Thompson,et al.  Issues in bioinformatics benchmarking: the case study of multiple sequence alignment , 2010, Nucleic acids research.

[39]  M. Vihinen,et al.  Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods , 2009, Human mutation.

[40]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[41]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[42]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[43]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[44]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[45]  G. Rubin,et al.  A computer program for aligning a cDNA sequence with a genomic DNA sequence. , 1998, Genome research.

[46]  M. Vihinen,et al.  Classification of mismatch repair gene missense variants with PON‐MMR , 2012, Human mutation.

[47]  Zhiping Weng,et al.  Protein–protein docking benchmark version 4.0 , 2010, Proteins.

[48]  Mauno Vihinen,et al.  PON‐P: Integrated predictor for pathogenicity of missense variants , 2012, Human mutation.

[49]  Gajendra P. S. Raghava,et al.  OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy , 2003, BMC Bioinformatics.

[50]  Rachel Karchin,et al.  Next generation tools for the annotation of human SNPs , 2009, Briefings Bioinform..

[51]  Peter B. McGarvey,et al.  A comprehensive protein-centric ID mapping service for molecular data integration , 2011, Bioinform..

[52]  Qianqian Zhu,et al.  Preferred analysis methods for Affymetrix GeneChips. II. An expanded, balanced, wholly-defined spike-in dataset , 2010, BMC Bioinformatics.

[53]  Terence P. Speed,et al.  A benchmark for Affymetrix GeneChip expression measures , 2004, Bioinform..

[54]  Marek Kimmel,et al.  Prediction of missense mutation functionality depends on both the algorithm and sequence alignment employed , 2011, Human mutation.

[55]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[56]  Piero Fariselli,et al.  A neural-network-based method for predicting protein stability changes upon single point mutations , 2004, ISMB/ECCB.

[57]  Emidio Capriotti,et al.  Bioinformatics for personal genome interpretation , 2012, Briefings Bioinform..