Variant detection sensitivity and biases in whole genome and exome sequencing

BackgroundLess than two percent of the human genome is protein coding, yet that small fraction harbours the majority of known disease causing mutations. Despite rapidly falling whole genome sequencing (WGS) costs, much research and increasingly the clinical use of sequence data is likely to remain focused on the protein coding exome. We set out to quantify and understand how WGS compares with the targeted capture and sequencing of the exome (exome-seq), for the specific purpose of identifying single nucleotide polymorphisms (SNPs) in exome targeted regions.ResultsWe have compared polymorphism detection sensitivity and systematic biases using a set of tissue samples that have been subject to both deep exome and whole genome sequencing. The scoring of detection sensitivity was based on sequence down sampling and reference to a set of gold-standard SNP calls for each sample. Despite evidence of incremental improvements in exome capture technology over time, whole genome sequencing has greater uniformity of sequence read coverage and reduced biases in the detection of non-reference alleles than exome-seq. Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. Known disease causing mutations are not biased towards easy or hard to sequence areas of the genome for either exome-seq or WGS.ConclusionsFrom an economic perspective, WGS is at parity with exome-seq for variant detection in the targeted coding regions. WGS offers benefits in uniformity of read coverage and more balanced allele ratio calls, both of which can in most cases be offset by deeper exome-seq, with the caveat that some exome-seq targets will never achieve sufficient mapped read depth for variant detection due to technical difficulties or probe failures. As WGS is intrinsically richer data that can provide insight into polymorphisms outside coding regions and reveal genomic rearrangements, it is likely to progressively replace exome-seq for many applications.

[1]  Magalie S Leduc,et al.  Clinical whole-exome sequencing for the diagnosis of mendelian disorders. , 2013, The New England journal of medicine.

[2]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[3]  Stephen C. J. Parker,et al.  Accurate and comprehensive sequencing of personal genomes. , 2011, Genome research.

[4]  Sebastian Bauer,et al.  The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process , 2011, Nucleic acids research.

[5]  Richard Durbin,et al.  A large genome center's improvements to the Illumina sequencing system , 2008, Nature Methods.

[6]  Heikki Joensuu,et al.  Comparison of solution-based exome capture methods for next generation sequencing , 2011, Genome Biology.

[7]  C. Ponting,et al.  Sequencing depth and coverage: key considerations in genomic analyses , 2014, Nature Reviews Genetics.

[8]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[9]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[10]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[11]  L. Prokunina-Olsson,et al.  Detection of bladder, breast and prostate cancer using serum and tissue miRNA profiling , 2011, Genome Biology.

[12]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[13]  M. Leboyer,et al.  A mechanistic basis for amplification differences between samples and between genome regions , 2012, BMC Genomics.

[14]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[15]  Hugo Y. K. Lam,et al.  Performance comparison of exome DNA sequencing technologies , 2011, Nature Biotechnology.

[16]  William,et al.  The Metabolic and Molecular Bases of Inherited Disease (Scriver, C. R., Beaudet, A. L., Sly, W. S., Valle, D., Childs, B., Kinzler, K. W., and Vogelstein, B., eds., 8th ed., McGraw-Hill, New-York, 2001, 7012 p., $550.00) , 2004, Biochemistry (Moscow).

[17]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[18]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[19]  David G. Knowles,et al.  Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[20]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[21]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[22]  Alison M. Meynert,et al.  Quantifying single nucleotide variant detection sensitivity in exome sequencing , 2013, BMC Bioinformatics.

[23]  C. Scriver,et al.  The Metabolic and Molecular Bases of Inherited Disease, 8th Edition 2001 , 2001, Journal of Inherited Metabolic Disease.

[24]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[25]  Laurent Gil,et al.  Ensembl 2013 , 2012, Nucleic Acids Res..

[26]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.