Fast and Accurate Genomic Analyses using Genome Graphs

The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, which impairs read alignment and downstream analysis accuracy. Reference genome structures incorporating known genetic variation have been shown to improve the accuracy of genomic analyses, but have so far remained computationally prohibitive for routine large-scale use. Here we present a graph genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million indels. Our Graph Genome Pipeline requires 6.5 hours to process a 30x coverage WGS sample on a system with 36 CPU cores compared with 11 hours required by the GATK Best Practices pipeline. Using complementary benchmarking experiments based on real and simulated data, we show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, or about 20,000 additional variants being detected per sample, while variant calling specificity is unaffected. Structural variations (SVs) incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is a significant advance towards fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.

[1]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[2]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[3]  Si Quang Le,et al.  Building Population-Specific Reference Genomes: A Case Study of Vietnamese Reference Genome , 2015, 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE).

[4]  Gonçalo R. Abecasis,et al.  Unified representation of genetic variants , 2015, Bioinform..

[5]  Mauricio O. Carneiro,et al.  Scaling accurate genetic variant discovery to tens of thousands of samples , 2017, bioRxiv.

[6]  Deniz Kural Methods for Inter- and Intra-Species Genomics for the Detection of Variation and Function , 2014 .

[7]  Jérôme Goudet,et al.  Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data , 2014 .

[8]  Michael W. Weiner,et al.  Comparison of multi-sample variant calling methods for whole genome sequencing , 2014, 2014 8th International Conference on Systems Biology (ISB).

[9]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[10]  G. N. Hannan,et al.  Estimating genotyping error rates from Mendelian errors in SNP array genotypes and their impact on inference. , 2007, Genomics.

[11]  Ronald W. Davis,et al.  Rare variant detection using family-based sequencing analysis , 2013, Proceedings of the National Academy of Sciences.

[12]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[13]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[14]  Whitney Wooderchak-Donahue,et al.  A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data , 2013, Bioinform..

[15]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[16]  Tom R. Gaunt,et al.  Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel , 2015, Nature Communications.

[17]  Deniz Kural,et al.  geck: trio-based comparative benchmarking of variant calls , 2017, bioRxiv.

[18]  Omar E. Cornejo,et al.  Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence , 2011, PLoS genetics.

[19]  G. Narkis,et al.  Autosomal recessive lethal congenital contractural syndrome type 4 (LCCS4) caused by a mutation in MYBPC1 , 2012, Human mutation.

[20]  Dmitry A. Dmitriev,et al.  Decoding of Superimposed Traces Produced by Direct Sequencing of Heterozygous Indels , 2008, PLoS Comput. Biol..

[21]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[22]  John G. Cleary,et al.  Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data , 2014, bioRxiv.

[23]  Jeanette C Papp,et al.  Detection and integration of genotyping errors in statistical genetics. , 2002, American journal of human genetics.

[24]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[25]  Jonathan Sebat,et al.  SV2: Accurate Structural Variation Genotyping and De Novo Mutation Detection from Whole Genomes , 2017, bioRxiv.

[26]  K. Yamamoto,et al.  GLOBAL ALLIANCE FOR GENOMICS AND HEALTH , 2015 .

[27]  Jaana M. Hartikainen,et al.  Large-scale genotyping identifies 41 new loci associated with breast cancer risk , 2013, Nature Genetics.

[28]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[29]  Esko Ukkonen,et al.  Approximate Boyer-Moore String Matching , 1993, SIAM J. Comput..

[30]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[31]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[32]  Jinliang Wang,et al.  Sibship reconstruction from genetic data with typing errors. , 2004, Genetics.

[33]  Kari Stefansson,et al.  Graphtyper enables population-scale genotyping using pangenome graphs , 2017, Nature Genetics.

[34]  Ryan E. Mills,et al.  An initial map of insertion and deletion (INDEL) variation in the human genome. , 2006, Genome research.

[35]  Mahmoud Zirie,et al.  The Qatar genome: a population-specific tool for precision medicine in the Middle East , 2016, Human Genome Variation.

[36]  Pieter B. T. Neerincx,et al.  Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels , 2015, Nature communications.

[37]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[38]  T. Spector,et al.  Parametric model‐based statistics for possible genotyping errors and sample stratification in sibling‐pair SNP data , 2009, Genetic epidemiology.

[39]  Paul Medvedev,et al.  Genome Graphs , 2010 .

[40]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[41]  Dan Geiger,et al.  Integration of SNP genotyping confidence scores in IBD inference , 2011, Bioinform..

[42]  H. Skaug,et al.  Estimating genotyping error rates from parent–offspring dyads , 2013 .

[43]  O. Birk,et al.  Deciphering the fine-structure of tribal admixture in the Bedouin population using genomic data , 2013, Heredity.

[44]  Michael Boehnke,et al.  Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. , 2002, American journal of human genetics.

[45]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2016, Algorithms for Molecular Biology.

[46]  Pham Bao Son,et al.  AB050. Building population-specific reference genomes: a case study of Vietnamese reference genome. , 2015 .

[47]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[48]  L. Jostins Inferring genotyping error rates from genotyped trios , 2011, 1109.1462.

[49]  Wei Chen,et al.  Genotype calling and haplotyping in parent-offspring trios , 2013, Genome research.

[50]  Brian L Browning,et al.  Detecting identity by descent and estimating genotype error rates in sequence data. , 2013, American journal of human genetics.

[51]  Christopher R. Gignoux,et al.  Human demographic history impacts genetic risk prediction across diverse populations , 2016, bioRxiv.

[52]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[53]  G. Weinstock,et al.  TIGRA: A targeted iterative graph routing assembler for breakpoint assembly , 2014, Genome research.

[54]  Kengo Kinoshita,et al.  Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals , 2015, Nature Communications.

[55]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[56]  Lars Bolund,et al.  Sequencing and de novo assembly of 150 genomes from Denmark as a population reference , 2017, Nature.

[57]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[58]  Gabor T. Marth,et al.  SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications , 2012, PloS one.

[59]  Edwin Cuppen,et al.  Sambamba: fast processing of NGS alignment formats , 2015, Bioinform..

[60]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[61]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.

[62]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[63]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[64]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[65]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[66]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[67]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[68]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[69]  Bjarni V. Halldórsson,et al.  Diversity in non-repetitive human sequences not found in the reference genome , 2017, Nature Genetics.

[70]  Lars Feuk,et al.  The Database of Genomic Variants: a curated collection of structural variation in the human genome , 2013, Nucleic Acids Res..

[71]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[72]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[73]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[74]  Udi Manber,et al.  Fast Text Searching With Errors , 2005 .

[75]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[76]  D. Landau,et al.  A Deletion Mutation in TMEM38B Associated with Autosomal Recessive Osteogenesis Imperfecta , 2013, Human mutation.

[77]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[78]  F. Kronenberg,et al.  American Journal of Epidemiology Practice of Epidemiology Estimating the Single Nucleotide Polymorphism Genotype Misclassification from Routine Double Measurements in a Large Epidemiologic Sample , 2022 .

[79]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[80]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[81]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[82]  Jon Bentley,et al.  Programming pearls: algorithm design techniques , 1984, CACM.

[83]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[84]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[85]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[86]  Masao Nagasaki,et al.  A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads , 2013, Bioinform..

[87]  Heikki Hyyrö Bit-parallel approximate string matching algorithms with transposition , 2005, J. Discrete Algorithms.

[88]  D. Haydon,et al.  Maximum-Likelihood Estimation of Allelic Dropout and False Allele Error Rates From Microsatellite Genotypes in the Absence of Reference Data , 2007, Genetics.

[89]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[90]  Christian Gieger,et al.  Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture , 2013, Nature Genetics.

[91]  Shyr Yu,et al.  Genome measures used for quality control are dependent on gene function and ancestry , 2015, Bioinform..

[92]  Yun S. Song,et al.  The Simons Genome Diversity Project: 300 genomes from 142 diverse populations , 2016, Nature.

[93]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[94]  M. McVey,et al.  MMEJ repair of double-strand breaks (director's cut): deleted sequences and alternative endings. , 2008, Trends in genetics : TIG.