Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage

Next generation DNA sequencing technologies are rapidly transforming the world of human genomics. Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are still frequently debated. In our study we developed a set of statistical tools to systematically assess coverage of CDS regions provided by several modern WES platforms, as well as PCR-free WGS. Using several novel metrics to characterize exon coverage in WES and WGS, we showed that some of the WES platforms achieve substantially less biased CDS coverage than others, with lower within- and between-interval variation and virtually absent GC-content bias. We discovered that, contrary to a common view, most of the coverage bias in WES stems from mappability limitations of short reads, as well as exome probe design. We identified the ~ 500 kb region of human exome that could not be effectively characterized using short read technology. We also showed that the overall power for SNP and indel discovery in CDS region is virtually indistinguishable for WGS and best WES platforms. Our results indicate that deep WES (100x) using least biased technologies provides similar effective coverage (97% of 10x q10+ bases) and CDS variant discovery to the standard 30x WGS, suggesting that WES remains an efficient alternative to WGS in many applications. Our work could serve as a guide for selection of an up-to-date resequencing approach in human genomic studies.

[1]  Rémy Bruggmann,et al.  Clinical sequencing: is WGS the better WES? , 2016, Human Genetics.

[2]  S. Mundlos,et al.  Comparison of Exome and Genome Sequencing Technologies for the Complete Capture of Protein‐Coding Regions , 2015, Human mutation.

[3]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[4]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[5]  Melanie Bahlo,et al.  Recent advances in the detection of repeat expansions with short-read next-generation sequencing , 2018, F1000Research.

[6]  David P. Nusinow,et al.  Estimating the Selective Effects of Heterozygous Protein Truncating Variants from Human Exome Data , 2017, Nature Genetics.

[7]  J. Kitzman,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Whole exome capture in solution with 3Gbp of data , 2010 .

[8]  Hugo Y. K. Lam,et al.  Performance comparison of exome DNA sequencing technologies , 2011, Nature Biotechnology.

[9]  Joel Gelernter,et al.  The Role and Challenges of Exome Sequencing in Studies of Human Diseases , 2013, Front. Genet..

[10]  Patrick Callier,et al.  Clinical whole-exome sequencing for the diagnosis of rare disorders with congenital anomalies and/or intellectual disability: substantial interest of prospective annual reanalysis , 2017, Genetics in Medicine.

[11]  Edwin Cuppen,et al.  Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries , 2010, Nucleic acids research.

[12]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[13]  Ulrike Groemping,et al.  Relative Importance for Linear Regression in R: The Package relaimpo , 2006 .

[14]  B. Fernandez,et al.  Utility of whole‐exome sequencing for those near the end of the diagnostic odyssey: time to address gaps in care , 2015, Clinical genetics.

[15]  Ulrich M. Zanger,et al.  Pharmacogenetics of cytochrome P450 2B6 (CYP2B6): advances on polymorphisms, mechanisms, and clinical relevance , 2013, Front. Genet..

[16]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.

[17]  Jessica L. Larson,et al.  Validation of a high resolution NGS method for detecting spinal muscular atrophy carriers among phase 3 participants in the 1000 Genomes Project , 2015, BMC Medical Genetics.

[18]  Dorothy A. Thompson,et al.  Comprehensive Rare Variant Analysis via Whole-Genome Sequencing to Determine the Molecular Pathology of Inherited Retinal Disease. , 2017, American journal of human genetics.

[19]  M. Nei,et al.  Evolution by the birth-and-death process in multigene families of the vertebrate immune system. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[20]  P. Shannon,et al.  Exome sequencing identifies the cause of a Mendelian disorder , 2009, Nature Genetics.

[21]  M. Spector,et al.  A comparative analysis of exome capture , 2011, Genome Biology.

[22]  Eivind Hovig,et al.  Performance comparison of four exome capture systems for deep sequencing , 2014, BMC Genomics.

[23]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[24]  David R. FitzPatrick,et al.  Paediatric genomics: diagnosing rare disease in children , 2018, Nature Reviews Genetics.

[25]  Lei Shang,et al.  Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants , 2014, Proceedings of the National Academy of Sciences.

[26]  A. Predeus,et al.  Catching hidden variation: systematic correction of reference minor allele annotation in clinical variant calling , 2017, Genetics in Medicine.

[27]  Heikki Joensuu,et al.  Comparison of solution-based exome capture methods for next generation sequencing , 2011, Genome Biology.

[28]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[29]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[30]  Eric Samorodnitsky,et al.  Evaluation of Hybridization Capture Versus Amplicon‐Based Methods for Whole‐Exome Sequencing , 2015, Human mutation.

[31]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[32]  J. Maguire,et al.  Solution Hybrid Selection with Ultra-long Oligonucleotides for Massively Parallel Targeted Sequencing , 2009, Nature Biotechnology.

[33]  Jamie K Teer,et al.  Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. , 2010, Genome research.

[34]  Z. Xuan,et al.  Genome-wide in situ exon capture for selective resequencing , 2007, Nature Genetics.

[35]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[36]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[37]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[38]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[39]  Leslie G Biesecker,et al.  Diagnostic clinical genome and exome sequencing. , 2014, The New England journal of medicine.

[40]  S. O’Brien,et al.  Analytical “bake-off” of whole genome sequencing quality for the Genome Russia project using a small cohort for autoimmune hepatitis , 2018, PloS one.

[41]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[42]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[43]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[44]  Hui Jiang,et al.  Comprehensive comparison of three commercial human whole-exome capture platforms , 2011, Genome Biology.