Comprehensive variation discovery in single human genomes

Complete knowledge of the genetic variation in individual human genomes is a crucial foundation for understanding the etiology of disease. Genetic variation is typically characterized by sequencing individual genomes and comparing reads to a reference. Existing methods do an excellent job of detecting variants in approximately 90% of the human genome; however, calling variants in the remaining 10% of the genome (largely low-complexity sequence and segmental duplications) is challenging. To improve variant calling, we developed a new algorithm, DISCOVAR, and examined its performance on improved, low-cost sequence data. Using a newly created reference set of variants from the finished sequence of 103 randomly chosen fosmids, we find that some standard variant call sets miss up to 25% of variants. We show that the combination of new methods and improved data increases sensitivity by several fold, with the greatest impact in challenging regions of the human genome.

[1]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[2]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[3]  Aravinda Chakravarti,et al.  DNA duplication associated with Charcot-Marie-Tooth disease type 1A , 1991, Cell.

[4]  J. Sutcliffe,et al.  Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome , 1991, Cell.

[5]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[6]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[8]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[9]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[10]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[11]  E. Eichler,et al.  Shotgun sequence assembly and recent segmental duplications within the human genome , 2004, Nature.

[12]  Huda Y. Zoghbi,et al.  Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles , 2005, Nature Reviews Genetics.

[13]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[14]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[15]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[16]  B. Birren,et al.  Genome Project Standards in a New Era of Sequencing , 2009, Science.

[17]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[18]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[19]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[20]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[21]  Dennis C. Friedrich,et al.  A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries , 2011, Genome Biology.

[22]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[23]  Matthew Berriman,et al.  Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology , 2010, Bioinform..

[24]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[25]  Heng Li,et al.  Improving SNP discovery by base alignment quality , 2011, Bioinform..

[26]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[27]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[28]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.

[29]  C. Ottenheijm,et al.  Nebulin, a major player in muscle health and disease , 2011, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[30]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[31]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[32]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[33]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[34]  A. Gnirke,et al.  Paired-end sequencing of Fosmid libraries by Illumina , 2012, Genome research.

[35]  Wei Chen,et al.  A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families , 2012, PLoS genetics.

[36]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[37]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[38]  Whitney Wooderchak-Donahue,et al.  A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data , 2013, Bioinform..

[39]  James Lu,et al.  An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data , 2013, Genome research.

[40]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.