Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

[1]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[2]  P. Sharp,et al.  Evidence for a high frequency of simultaneous double-nucleotide substitutions. , 2000, Science.

[3]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Jerzy K. Kulski,et al.  High-throughput DNA typing of HLA-A, -B, -C, and -DRB1 loci by a PCR–SSOP–Luminex method in the Japanese population , 2005, Immunogenetics.

[5]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[6]  Pardis C Sabeti,et al.  A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC , 2006, Nature Genetics.

[7]  Maido Remm,et al.  Enhancements and modifications of primer design program Primer3 , 2007, Bioinform..

[8]  Michael S Waterman,et al.  Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi. , 2007, Genome research.

[9]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[10]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[11]  Peter Donnelly,et al.  A statistical method for predicting classical HLA alleles from SNP data. , 2008, American journal of human genetics.

[12]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[13]  M. McVey,et al.  MMEJ repair of double-strand breaks (director's cut): deleted sequences and alternative endings. , 2008, Trends in genetics : TIG.

[14]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[15]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[16]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[17]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[18]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[19]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[20]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[21]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[22]  Matthew W. Anderson,et al.  A multi-site study using high-resolution HLA genotyping by next generation sequencing. , 2011, Tissue antigens.

[23]  M. DePristo,et al.  Variation in genome-wide mutation rates within and between human families , 2011, Nature Genetics.

[24]  Nilgun Donmez,et al.  Hapsembler: An Assembler for Highly Polymorphic Genomes , 2011, RECOMB.

[25]  Martin Goodson,et al.  Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. , 2011, Genome research.

[26]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[27]  A. Pagnamenta,et al.  Exome sequencing can detect pathogenic mosaic mutations present at low allele frequencies , 2011, Journal of Human Genetics.

[28]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[29]  Michael F. Walker,et al.  De novo mutations revealed by whole-exome sequencing are strongly associated with autism , 2012, Nature.

[30]  S. Steinberg,et al.  Rate of de novo mutations and the importance of father’s age to disease risk , 2012, Nature.

[31]  K. Veeramah,et al.  De novo pathogenic SCN8A mutation identified by whole-genome sequencing of a family quartet affected by infantile epileptic encephalopathy and SUDEP. , 2012, American journal of human genetics.

[32]  Lilia M. Iakoucheva,et al.  Whole-Genome Sequencing in Autism Identifies Hot Spots for De Novo Germline Mutation , 2012, Cell.

[33]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[34]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[35]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[36]  D. Cooper,et al.  Transient hypermutability, chromothripsis and replication-based mechanisms in the generation of concurrent clustered mutations. , 2012, Mutation research.

[37]  Jay Shendure,et al.  Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations , 2012, Nature Genetics.

[38]  B. Faircloth,et al.  Primer3—new capabilities and interfaces , 2012, Nucleic acids research.

[39]  D. Cooper,et al.  From the periphery to centre stage: de novo single nucleotide variants play a key role in human genetic disease , 2013, Journal of Medical Genetics.

[40]  Semyon Kruglyak,et al.  Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms , 2013, Bioinform..

[41]  Peter Donnelly,et al.  Mosaic PPM1D mutations are associated with predisposition to breast and ovarian cancer , 2012, Nature.

[42]  Mark Gerstein,et al.  The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes , 2013, Genome research.

[43]  Peter Donnelly,et al.  Clinical whole-genome sequencing in severe early-onset epilepsy reveals new genes and improves molecular diagnosis , 2014, Human molecular genetics.