ATAV: a comprehensive platform for population-scale genomic analyses

Background A common approach for sequencing studies is to do joint-calling and store variants of all samples in a single file. If new samples are continually added or controls are re-used for several studies, the cost and time required to perform joint-calling for each analysis can become prohibitive. Results We present ATAV, an analysis platform for large-scale whole-exome and whole-genome sequencing projects. ATAV stores variant and per site coverage data for all samples in a centralized database, which is efficiently queried by ATAV to support diagnostic analyses for trios and singletons, as well as rare-variant collapsing analyses for finding disease associations in complex diseases. Runtime logs ensure full reproducibility and the modularized ATAV framework makes it extensible to continuous development. Besides helping with the identification of disease-causing variants for a range of diseases, ATAV has also enabled the discovery of disease-genes by rare-variant collapsing on datasets containing more than 20,000 samples. Analyses to date have been performed on data of more than 110,000 individuals demonstrating the scalability of the framework. To allow users to easily access variant-level data directly from the database, we provide a web-based interface, the ATAV data browser ( http://atavdb.org/ ). Through this browser, summary-level data for more than 40,000 samples can be queried by the general public representing a mix of cases and controls of diverse ancestries. Users have access to phenotype categories of variant carriers, as well as predicted ancestry, gender, and quality metrics. In contrast to many other platforms, the data browser is able to show data of newly-added samples in real-time and therefore evolves rapidly as more and more samples are sequenced. Conclusions Through ATAV, users have public access to one of the largest variant databases for patients sequenced at a tertiary care center and can look up any genes or variants of interest. Additionally, since the entire code is freely available on GitHub, ATAV can easily be deployed by other groups that wish to build their own platform, database, and user interface.

[1]  I. Scheffer,et al.  De Novo Mutations in PPP3CA Cause Severe Neurodevelopmental Disease with Seizures. , 2017, American journal of human genetics.

[2]  D. Goldstein,et al.  Exome-wide Association Study Identifies GREB1L Mutations in Congenital Kidney Malformations. , 2017, American journal of human genetics.

[3]  D. Goldstein,et al.  A case–control collapsing analysis identifies retinal dystrophy genes associated with ophthalmic disease in patients with no pathogenic ABCA4 variants , 2019, Genetics in Medicine.

[4]  Chunhua Weng,et al.  Diagnostic Utility of Exome Sequencing for Kidney Disease , 2019, The New England journal of medicine.

[5]  Zhong Ren,et al.  Annotating pathogenic non-coding variants in genic regions , 2017, Nature Communications.

[6]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[7]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[8]  J. Rosenfeld,et al.  Germline De Novo Mutations in GNB1 Cause Severe Neurodevelopmental Disability, Hypotonia, and Seizures. , 2016, American journal of human genetics.

[9]  Michael R. Johnson,et al.  De novo mutations in the classic epileptic encephalopathies , 2013, Nature.

[10]  Ayal B. Gussow,et al.  The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity , 2015, PLoS genetics.

[11]  D. Goldstein,et al.  Rare-variant collapsing analyses for complex traits: guidelines and applications , 2019, Nature Reviews Genetics.

[12]  Marylyn D. Ritchie,et al.  Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study , 2016, Science.

[13]  D. Goldstein,et al.  An Exome Sequencing Study to Assess the Role of Rare Genetic Variation in Pulmonary Fibrosis , 2017, American journal of respiratory and critical care medicine.

[14]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[15]  Chunlei Liu,et al.  ClinVar: improving access to variant interpretations and supporting evidence , 2017, Nucleic Acids Res..

[16]  Irina M. Armean,et al.  The mutational constraint spectrum quantified from variation in 141,456 humans , 2019, Nature.

[17]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[18]  Jörg Hakenberg,et al.  Predicting the clinical impact of human mutation with deep neural networks , 2018, Nature Genetics.

[19]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[20]  Alexander E. Lopez,et al.  Exome sequencing and characterization of 49,960 individuals in the UK Biobank , 2020, Nature.

[21]  Brittany N. Lasseigne,et al.  Exome sequencing in amyotrophic lateral sclerosis identifies risk genes and pathways , 2015, Science.

[22]  K. M. McSweeney,et al.  Exome sequencing results in successful riboflavin treatment of a rapidly progressive neurological condition , 2015, Cold Spring Harbor molecular case studies.

[23]  Heidi L Rehm,et al.  ClinGen--the Clinical Genome Resource. , 2015, The New England journal of medicine.

[24]  Brian E. Cade,et al.  Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program , 2019, Nature.

[25]  Brent S. Pedersen,et al.  A map of constrained coding regions in the human genome , 2017, Nature Genetics.

[26]  Ayal B. Gussow,et al.  The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes , 2016, Genome Biology.

[27]  Michael R. Johnson,et al.  Ultra-rare genetic variation in common epilepsies: a case-control sequencing study , 2017, The Lancet Neurology.

[28]  D. Goldstein,et al.  Whole Exome Sequencing in 20,197 Persons for Rare Variants in Alzheimer Disease , 2018, bioRxiv.

[29]  Trevor Hastie,et al.  REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. , 2016, American journal of human genetics.

[30]  I. Scheffer,et al.  Exome‐based analysis of cardiac arrhythmia, respiratory control, and epilepsy genes in sudden unexpected death in epilepsy , 2016, Annals of neurology.

[31]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[32]  David B. Goldstein,et al.  De novo mutations in ATP1A3 cause alternating hemiplegia of childhood , 2012, Nature Genetics.

[33]  Yujun Han,et al.  Whole-exome sequencing in undiagnosed genetic diseases: interpreting 119 trios , 2015, Genetics in Medicine.

[34]  Sri V. V. Deevi,et al.  Assessing the Role of Rare Genetic Variation in Patients With Heart Failure. , 2020, JAMA cardiology.

[35]  D. Goldstein,et al.  Whole‐exome sequencing in 20,197 persons for rare variants in Alzheimer's disease , 2018, Annals of clinical and translational neurology.

[36]  D. Goldstein,et al.  Exome-Based Rare-Variant Analyses in CKD. , 2019, Journal of the American Society of Nephrology : JASN.

[37]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[38]  D. Goldstein,et al.  Improved Pathogenic Variant Localization using a Hierarchical Model of Sub-regional Intolerance , 2018, bioRxiv.

[39]  Sigurjón Axel Guðjónsson,et al.  GORpipe: a query tool for working with sequence data based on a Genomic Ordered Relational (GOR) architecture , 2016, Bioinform..

[40]  Christie M. Buchovecky,et al.  Causal Genetic Variants in Stillbirth. , 2020, The New England journal of medicine.

[41]  Brittany N. Lasseigne,et al.  A new approach for rare variation collapsing on functional protein domains implicates specific genic regions in ALS , 2019, Genome research.

[42]  I. Scheffer,et al.  A case-control collapsing analysis identifies epilepsy genes implicated in trio sequencing studies focused on de novo mutations , 2017, PLoS genetics.

[43]  Gad Abraham,et al.  FlashPCA2: principal component analysis of biobank-scale genotype datasets , 2016, bioRxiv.

[44]  David J Balding,et al.  Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation , 2017, Genome research.

[45]  Gonçalo Abecasis,et al.  Whole exome sequencing and characterization of coding variation in 49,960 individuals in the UK Biobank , 2019, bioRxiv.