Pangenome-based genome inference

Typical analysis workflows map reads to a reference genome in order to detect genetic variants. Generating such alignments introduces references biases, in particular against insertion alleles absent in the reference and comes with substantial computational burden. In contrast, recent k-mer-based genotyping methods are fast, but struggle in repetitive or duplicated regions of the genome. We propose a novel algorithm, called PanGenie, that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation. The given haplotypes enable our method to take advantage of linkage information to aid genotyping in regions poorly covered by unique k-mers and provides access to regions otherwise inaccessible by short reads. Compared to classic mapping-based approaches, our approach is more than 4× faster at 30× coverage and at the same time, reached significantly better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (> 50bp), where we are able to genotype > 99.9% of all tested variants with over 90% accuracy at 30× short-read coverage, where the best competing tools either typed less than 60% of variants or reached accuracies below 70%. PanGenie now enables the inclusion of this commonly neglected variant type in downstream analyses.

[1]  Tariq Ahmad,et al.  A structural variation reference for medical and population genetics , 2020, Nature.

[2]  William T. Harvey,et al.  A fully phased accurate assembly of an individual human genome , 2019, bioRxiv.

[3]  Andrew Carroll,et al.  Efficient chromosome-scale haplotype-resolved assembly of human genomes , 2019, bioRxiv.

[4]  J. Metcalf,et al.  Convergence of human and Old World monkey gut microbiomes demonstrates the importance of human ecology over phylogeny , 2019, Genome Biology.

[5]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[6]  Sergey Koren,et al.  A robust benchmark for germline structural variant detection , 2019, bioRxiv.

[7]  Glenn Hickey,et al.  Genotyping structural variants in pangenome graphs using the vg toolkit , 2019, Genome Biology.

[8]  Michael C. Schatz,et al.  Paragraph: a graph-based structural variant genotyper for short-read sequence data , 2019, Genome Biology.

[9]  Chunlin Xiao,et al.  An open resource for accurately benchmarking small variant and reference calls , 2019, Nature Biotechnology.

[10]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[11]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[12]  Anders Krogh,et al.  Accurate genotyping across variant classes and lengths using variant graphs , 2018, Nature Genetics.

[13]  Sergey Koren,et al.  Complete assembly of parental haplotypes with trio binning , 2018, bioRxiv.

[14]  Wan-Ping Lee,et al.  Fast and accurate genomic analyses using genome graphs , 2019, Nature Genetics.

[15]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[16]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[17]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[18]  Gil McVean,et al.  Integrating long-range connectivity information into de Bruijn graphs , 2017, bioRxiv.

[19]  Alan M. Kwong,et al.  Next-generation genotype imputation service and methods , 2016, Nature Genetics.

[20]  Zamin Iqbal,et al.  Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes , 2016, bioRxiv.

[21]  Brian L Browning,et al.  Genotype Imputation with Millions of Reference Samples. , 2016, American journal of human genetics.

[22]  Yan Guo,et al.  Population structure analysis on 2504 individuals across 26 ancestries using bioinformatics approaches , 2015, BMC Bioinformatics.

[23]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[24]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[25]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[26]  Shyr Yu,et al.  Genome measures used for quality control are dependent on gene function and ancestry , 2015, Bioinform..

[27]  Ryan M. Layer,et al.  SpeedSeq: Ultra-fast personal genome analysis and interpretation , 2014, Nature Methods.

[28]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[29]  Erich Bornberg-Bauer,et al.  MDAT- Aligning multiple domain arrangements , 2015, BMC Bioinformatics.

[30]  Yan Guo,et al.  Three-stage quality control strategies for DNA re-sequencing data , 2014, Briefings Bioinform..

[31]  A. Quinlan BEDTools: The Swiss‐Army Tool for Genome Feature Analysis , 2014, Current protocols in bioinformatics.

[32]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[33]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[34]  Jonathan Marchini,et al.  Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold , 2013, Bioinform..

[35]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[36]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[37]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[38]  J. Sebat,et al.  High Frequencies of De Novo CNVs in Bipolar Disorder and Schizophrenia , 2011, Neuron.

[39]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[40]  Kathryn Roeder,et al.  Multiple Recurrent De Novo CNVs, Including Duplications of the 7q11.23 Williams Syndrome Region, Are Strongly Associated with Autism , 2011, Neuron.

[41]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[42]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[43]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[44]  H. Stefánsson,et al.  Supplementary webappendix , 2018 .

[45]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[46]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[47]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[48]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[49]  A. Singleton,et al.  Rare Structural Variants Disrupt Multiple Genes in Neurodevelopmental Pathways in Schizophrenia , 2008, Science.

[50]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[51]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[52]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[53]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .