Using off-target data from whole-exome sequencing to improve genotyping accuracy, association analysis and polygenic risk prediction

Whole-exome sequencing (WES) has been widely used to study the role of protein-coding variants in genetic diseases. Non-coding regions, typically covered by sparse off-target data, are often discarded by conventional WES analyses. Here, we develop a genotype calling pipeline named WEScall to analyse both target and off-target data. We leverage linkage disequilibrium shared within study samples and from an external reference panel to improve genotyping accuracy. In an application to WES of 2527 Chinese and Malays, WEScall can reduce the genotype discordance rate from 0.26% (SE= 6.4 × 10-6) to 0.08% (SE = 3.6 × 10-6) across 1.1 million single nucleotide polymorphisms (SNPs) in the deeply sequenced target regions. Furthermore, we obtain genotypes at 0.70% (SE = 3.0 × 10-6) discordance rate across 5.2 million off-target SNPs, which had ~1.2× mean sequencing depth. Using this dataset, we perform genome-wide association studies of 10 metabolic traits. Despite of our small sample size, we identify 10 loci at genome-wide significance (P < 5 × 10-8), including eight well-established loci. The two novel loci, both associated with glycated haemoglobin levels, are GPATCH8-SLC4A1 (rs369762319, P = 2.56 × 10-12) and ROR2 (rs1201042, P = 3.24 × 10-8). Finally, using summary statistics from UK Biobank and Biobank Japan, we show that polygenic risk prediction can be significantly improved for six out of nine traits by incorporating off-target data (P < 0.01). These results demonstrate WEScall as a useful tool to facilitate WES studies with decent amounts of off-target data.

[1]  E. Schuetz,et al.  SUGP1 is a novel regulator of cholesterol metabolism , 2016, Human molecular genetics.

[2]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[3]  Tanya M. Teslovich,et al.  Discovery and refinement of loci associated with lipid levels , 2013, Nature Genetics.

[4]  Gonçalo R. Abecasis,et al.  Unified representation of genetic variants , 2015, Bioinform..

[5]  M. Alda,et al.  The relationship between bipolar disorder and type 2 diabetes: More than just co-morbid disorders , 2013, Annals of medicine.

[6]  Brian L Browning,et al.  Genotype Imputation with Millions of Reference Samples. , 2016, American journal of human genetics.

[7]  M. Daly,et al.  An Atlas of Genetic Correlations across Human Diseases and Traits , 2015, Nature Genetics.

[8]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[9]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[10]  D. Gauguier,et al.  T2DM GWAS in the Lebanese population confirms the role of TCF7L2 and CDKAL1 in disease susceptibility , 2014, Scientific Reports.

[11]  Inês Barroso,et al.  Variants in MTNR1B influence fasting glucose levels , 2009, Nature Genetics.

[12]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[13]  Santhosh Girirajan,et al.  Novel metrics to measure coverage in whole exome sequencing datasets reveal local and global non-uniformity , 2016, Scientific Reports.

[14]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[15]  P. Sternberg,et al.  Ror receptor tyrosine kinases: orphans no more. , 2008, Trends in cell biology.

[16]  Stephen Burgess,et al.  Combining information on multiple instrumental variables in Mendelian randomization: comparison of allele score and summarized data methods , 2015, Statistics in medicine.

[17]  Matthew S. Lebo,et al.  Polygenic background modifies penetrance of monogenic variants conferring risk for coronary artery disease, breast cancer, or colorectal cancer , 2019 .

[18]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[19]  Y. Teo,et al.  A Study Assessing the Association of Glycated Hemoglobin A1C (HbA1C) Associated Variants with HbA1C, Chronic Kidney Disease and Diabetic Retinopathy in Populations of Asian Ancestry , 2013, PloS one.

[20]  Yun Li,et al.  METAL: fast and efficient meta-analysis of genomewide association scans , 2010, Bioinform..

[21]  S. Grundy,et al.  A hepatic lipase (LIPC) allele associated with high plasma concentrations of high density lipoprotein cholesterol. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Mary E. Haas,et al.  Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations , 2018, Nature Genetics.

[23]  Chaolong Wang,et al.  Ancestry estimation and control of population stratification for sequence-based association studies , 2014, Nature Genetics.

[24]  Edwin Cuppen,et al.  Sambamba: fast processing of NGS alignment formats , 2015, Bioinform..

[25]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[26]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[27]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[28]  Y. Teo,et al.  Singapore Genome Variation Project: a haplotype map of three Southeast Asian populations. , 2009, Genome research.

[29]  R. Lehner,et al.  Carboxylesterases in lipid metabolism: from mouse to human , 2017, Protein & Cell.

[30]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[31]  Y. Kamatani,et al.  Overview of the BioBank Japan Project: Study design and profile , 2017, Journal of epidemiology.

[32]  Hyun Min Kang,et al.  Correcting for Sample Contamination in Genotype Calling of DNA Sequence Data. , 2015, American journal of human genetics.

[33]  R. Carney,et al.  Depression and poor glycemic control: a meta-analytic review of the literature. , 2000, Diabetes care.

[34]  T. Manolio,et al.  How to Interpret a Genome-wide Association Study Topic Collections , 2022 .

[35]  M. Daly,et al.  Genetic and Epigenetic Fine-Mapping of Causal Autoimmune Disease Variants , 2014, Nature.

[36]  Alex P. Reiner,et al.  Mendelian randomization of blood lipids for coronary heart disease , 2014, European heart journal.

[37]  L. Nijtmans,et al.  C6orf203 controls OXPHOS function through modulation of mitochondrial protein biosynthesis , 2019, bioRxiv.

[38]  Dermot F. Reilly,et al.  Estimation of kinship coefficient in structured and admixed populations using sparse sequencing data , 2017, PLoS genetics.

[39]  C. K. Lim,et al.  Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore , 2019, Cell.

[40]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[41]  Guangbo Ge,et al.  Human carboxylesterases: a comprehensive review , 2018, Acta pharmaceutica Sinica. B.

[42]  Jie Luo,et al.  Discovery of a potent HMG-CoA reductase degrader that eliminates statin-induced reductase accumulation and lowers cholesterol , 2018, Nature Communications.

[43]  John J. Aponte,et al.  Reduced Risk of Plasmodium vivax Malaria in Papua New Guinean Children with Southeast Asian Ovalocytosis in Two Cohorts and a Case-Control Study , 2012, PLoS medicine.

[44]  Zhaoxia Yu,et al.  Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. , 2009, American journal of human genetics.

[45]  L. Liang,et al.  Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. , 2015, American journal of human genetics.

[46]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[47]  G. Abecasis,et al.  An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data , 2015, Genome research.

[48]  A. Firth,et al.  C6orf203 is an RNA-binding protein involved in mitochondrial protein synthesis , 2019, Nucleic acids research.

[49]  G. Abecasis,et al.  Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. , 2012, American journal of human genetics.

[50]  William J. Astle,et al.  Allelic Landscape of Human Blood Cell Trait Variation and Links , 2016 .

[51]  Shing Wan Choi,et al.  PRSice-2: Polygenic Risk Score software for biobank-scale data , 2019, GigaScience.

[52]  Michael Boehnke,et al.  LASER server: ancestry tracing with genotypes or sequence reads , 2017, Bioinform..

[53]  Andrew D. Johnson,et al.  Multiple rare alleles at LDLR and APOA5 confer risk for early-onset myocardial infarction , 2014, Nature.

[54]  A. Barria,et al.  RoR2 functions as a noncanonical Wnt receptor that regulates NMDAR-mediated synaptic transmission , 2015, Proceedings of the National Academy of Sciences.

[55]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[56]  Jennifer G. Robinson,et al.  Whole-exome sequencing identifies rare and low-frequency coding variants associated with LDL cholesterol. , 2014, American journal of human genetics.

[57]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[58]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[59]  Alicia R. Martin,et al.  Clinical use of current polygenic risk scores may exacerbate health disparities , 2019, Nature Genetics.

[60]  Michael Boehnke,et al.  LocusZoom: regional visualization of genome-wide association scan results , 2010, Bioinform..

[61]  B. Kerner,et al.  Bipolar disorder and diabetes mellitus: evidence for disease-modifying effects and treatment implications , 2016, International Journal of Bipolar Disorders.

[62]  A. Whittemore,et al.  Genetically Predicted Body Mass Index and Breast Cancer Risk: Mendelian Randomization Analyses of Data from 145,000 Women of European Descent , 2016, PLoS medicine.

[63]  L. Liang,et al.  Extremely low-coverage sequencing and imputation increases power for genome-wide association studies , 2012, Nature Genetics.

[64]  G. Abecasis,et al.  Low-coverage sequencing: implications for design of complex trait association studies. , 2011, Genome research.

[65]  Nobuyuki Onishi,et al.  The receptor tyrosine kinase Ror2 is involved in non‐canonical Wnt5a/JNK signalling pathway , 2003, Genes to cells : devoted to molecular & cellular mechanisms.

[66]  Margaret A. Pericak-Vance,et al.  Identification of a Rare Coding Variant in Complement 3 Associated with Age-related Macular Degeneration , 2013, Nature Genetics.

[67]  J. Holst,et al.  Genetic variation within the TRPM5 locus associates with prediabetic phenotypes in subjects at increased risk for type 2 diabetes. , 2011, Metabolism: clinical and experimental.

[68]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..