The VAAST Variant Prioritizer (VVP): ultrafast, easy to use whole genome variant prioritization tool

BackgroundPrioritization of sequence variants for diagnosis and discovery of Mendelian diseases is challenging, especially in large collections of whole genome sequences (WGS). Fast, scalable solutions are needed for discovery research, for clinical applications, and for curation of massive public variant repositories such as dbSNP and gnomAD. In response, we have developed VVP, the VAAST Variant Prioritizer. VVP is ultrafast, scales to even the largest variant repositories and genome collections, and its outputs are designed to simplify clinical interpretation of variants of uncertain significance.ResultsWe show that scoring the entire contents of dbSNP (> 155 million variants) requires only 95 min using a machine with 4 cpus and 16 GB of RAM, and that a 60X WGS can be processed in less than 5 min. We also demonstrate that VVP can score variants anywhere in the genome, regardless of type, effect, or location. It does so by integrating sequence conservation, the type of sequence change, allele frequencies, variant burden, and zygosity. Finally, we also show that VVP scores are consistently accurate, and easily interpreted, traits not shared by many commonly used tools such as SIFT and CADD.ConclusionsVVP provides rapid and scalable means to prioritize any sequence variant, anywhere in the genome, and its scores are designed to facilitate variant interpretation using ACMG and NHS guidelines. These traits make it well suited for operation on very large collections of WGS sequences.

[1]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[2]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[3]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[4]  Mark Yandell,et al.  Using VAAST to Identify Disease‐Associated Variants in Next‐Generation Sequencing Data , 2014, Current protocols in human genetics.

[5]  David H. Dreyfus,et al.  CYSTIC FIBROSIS HETEROZYGOTE RESISTANCE TO CHOLERA TOXIN IN THE CYSTIC FIBROSIS MOUSE MODEL , 1995, Pediatrics.

[6]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[7]  Gustavo Glusman,et al.  A unified test of linkage analysis and rare-variant association for analysis of pedigree sequence data , 2014, Nature Biotechnology.

[8]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[9]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[10]  Vladimir B. Bajic,et al.  Semantic prioritization of novel causative genomic variants , 2017, PLoS Comput. Biol..

[11]  S. Henikoff,et al.  Predicting the effects of amino acid substitutions on protein function. , 2006, Annual review of genomics and human genetics.

[12]  Bale,et al.  Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology , 2015, Genetics in Medicine.

[13]  J. Miller,et al.  Predicting the Functional Effect of Amino Acid Substitutions and Indels , 2012, PloS one.

[14]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[15]  Mark Yandell,et al.  VAAST 2.0: Improved Variant Classification and Disease-Gene Identification Using a Conservation-Controlled Amino Acid Substitution Matrix , 2013, Genetic epidemiology.

[16]  M. G. Reese,et al.  A probabilistic disease-gene finder for personal genomes. , 2011, Genome research.

[17]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[18]  K. Eilbeck,et al.  Settling the score: variant prioritization and Mendelian disease , 2017, Nature Reviews Genetics.

[19]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[20]  Sheena M. Scroggins,et al.  CADD score has limited clinical validity for the identification of pathogenic variants in non-coding regions in a hereditary cancer panel , 2016, Genetics in Medicine.

[21]  D. G. MacArthur,et al.  Guidelines for investigating causality of sequence variants in human disease , 2014, Nature.

[22]  G. Abecasis,et al.  Rare-variant association analysis: study designs and statistical tests. , 2014, American journal of human genetics.

[23]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[24]  Giorgio Valentini,et al.  A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. , 2016, American journal of human genetics.

[25]  Karen Eilbeck,et al.  A standard variation file format for human genome sequences , 2010, Genome Biology.

[26]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[27]  Brett J. Kennedy,et al.  Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families. , 2014, American journal of human genetics.

[28]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[29]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.