A framework for variation discovery and genotyping using next-generation DNA sequencing data

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.

[1]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[2]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[4]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[5]  Carsten Schwarz,et al.  Genomewide comparison of DNA sequences between humans and chimpanzees. , 2002, American journal of human genetics.

[6]  Jan Freudenberg,et al.  Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. , 2003, Genome research.

[7]  Lei M. Li,et al.  Adjust quality scores from alignment and improve sequencing accuracy. , 2004, Nucleic acids research.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[10]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[11]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[12]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[13]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[14]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[15]  Jin Ok Yang,et al.  Mapping Human Genetic Diversity in Asia , 2009, Science.

[16]  Zhaoxia Yu,et al.  Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. , 2009, American journal of human genetics.

[17]  Francisco M. De La Vega,et al.  Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. , 2009, Genome research.

[18]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[19]  J. Maguire,et al.  Solution Hybrid Selection with Ultra-long Oligonucleotides for Massively Parallel Targeted Sequencing , 2009, Nature Biotechnology.

[20]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[21]  K. Dewar,et al.  A probabilistic approach for SNP discovery in high-throughput human resequencing data. , 2009, Genome research.

[22]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[23]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[24]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[25]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[26]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[27]  Steven J. M. Jones,et al.  High quality SNP calling using Illumina data at shallow coverage , 2010, Bioinform..

[28]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[29]  Asan,et al.  Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude , 2010, Science.

[30]  A. Sparks,et al.  The mutation spectrum revealed by paired genome sequences from a lung cancer patient , 2010, Nature.

[31]  Derek Y. Chiang,et al.  The landscape of somatic copy-number alteration across human cancers , 2010, Nature.

[32]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[33]  G. Weinstock,et al.  A SNP discovery method to assess variant allele probability from next-generation resequencing data. , 2010, Genome research.

[34]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[35]  P. Shannon,et al.  Exome sequencing identifies the cause of a Mendelian disorder , 2009, Nature Genetics.

[36]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[37]  P. Shannon,et al.  Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing , 2010, Science.

[38]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[39]  Edwin Cuppen,et al.  Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries , 2010, Nucleic acids research.

[40]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.

[41]  Joshua M. Korn,et al.  Discovery and genotyping of genome structural polymorphism by sequencing on a population scale , 2011, Nature Genetics.

[42]  M. DePristo,et al.  Variation in genome-wide mutation rates within and between human families , 2011, Nature Genetics.