Very low-depth whole-genome sequencing in complex trait association studies

Abstract Motivation Very low-depth sequencing has been proposed as a cost-effective approach to capture low-frequency and rare variation in complex trait association studies. However, a full characterization of the genotype quality and association power for very low-depth sequencing designs is still lacking. Results We perform cohort-wide whole-genome sequencing (WGS) at low depth in 1239 individuals (990 at 1× depth and 249 at 4× depth) from an isolated population, and establish a robust pipeline for calling and imputing very low-depth WGS genotypes from standard bioinformatics tools. Using genotyping chip, whole-exome sequencing (75× depth) and high-depth (22×) WGS data in the same samples, we examine in detail the sensitivity of this approach, and show that imputed 1× WGS recapitulates 95.2% of variants found by imputed GWAS with an average minor allele concordance of 97% for common and low-frequency variants. In our study, 1× further allowed the discovery of 140 844 true low-frequency variants with 73% genotype concordance when compared to high-depth WGS data. Finally, using association results for 57 quantitative traits, we show that very low-depth WGS is an efficient alternative to imputed GWAS chip designs, allowing the discovery of up to twice as many true association signals than the classical imputed GWAS design. Availability and implementation The HELIC genotype and WGS datasets have been deposited to the European Genome-phenome Archive (https://www.ebi.ac.uk/ega/home): EGAD00010000518; EGAD00010000522; EGAD00010000610; EGAD00001001636, EGAD00001001637. The peakplotter software is available at https://github.com/wtsi-team144/peakplotter, the transformPhenotype app can be downloaded at https://github.com/wtsi-team144/transformPhenotype. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Inês Barroso,et al.  Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits , 2018, Nature Communications.

[2]  Brian L Browning,et al.  Genotype Imputation with Millions of Reference Samples. , 2016, American journal of human genetics.

[3]  Jeremy Schwartzentruber,et al.  Whole genome sequencing and imputation in isolated populations identify genetic associations with medically-relevant complex traits , 2017, Nature Communications.

[4]  Simon Myers,et al.  Rapid genotype imputation from sequence without reference panels , 2016, Nature Genetics.

[5]  Oren E. Livne,et al.  PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population , 2015, PLoS Comput. Biol..

[6]  Andrew Carroll,et al.  Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology , 2017, Nature Genetics.

[7]  Céline Bellenguez,et al.  Strategies for phasing and imputation in a population isolate , 2018, Genetic epidemiology.

[8]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[9]  L. Liang,et al.  Extremely low-coverage sequencing and imputation increases power for genome-wide association studies , 2012, Nature Genetics.

[10]  Jean-François Zagury,et al.  Haplotype estimation using sequencing reads. , 2013, American journal of human genetics.

[11]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[12]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[13]  Alireza Moayyeri,et al.  The UK Adult Twin Registry (TwinsUK Resource) , 2012, Twin Research and Human Genetics.

[14]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[15]  M. Pembrey,et al.  ALSPAC--the Avon Longitudinal Study of Parents and Children. I. Study methodology. , 2001, Paediatric and perinatal epidemiology.

[16]  Zachariah Gompert,et al.  Population genomics based on low coverage sequencing: how low should we go? , 2013, Molecular ecology.

[17]  P. Donnelly,et al.  Genome-wide genetic data on ~500,000 UK Biobank participants , 2017, bioRxiv.

[18]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[19]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[20]  Ole Schulz-Trieglaff,et al.  AKT: Ancestry and Kinship Toolkit , 2016, bioRxiv.

[21]  Jie Huang,et al.  Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits , 2017, American journal of human genetics.

[22]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[23]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[24]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[25]  Warren W. Kretzschmar,et al.  Sparse whole genome sequencing identifies two loci for major depressive disorder , 2015, Nature.

[26]  Si Quang Le,et al.  SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. , 2011, Genome research.

[27]  Eleftheria Zeggini,et al.  Very low-depth sequencing in a founder population identifies a cardioprotective APOC3 signal missed by genome-wide imputation , 2016, Human molecular genetics.

[28]  William J. Astle,et al.  Allelic Landscape of Human Blood Cell Trait Variation and Links , 2016 .

[29]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.