Genome Graphs

There is increasing recognition that a single, monoploid reference genome is a poor universal reference structure for human genetics, because it represents only a tiny fraction of human variation. Adding this missing variation results in a structure that can be described as a mathematical graph: a genome graph. We demonstrate that, in comparison to the existing reference genome (GRCh38), genome graphs can substantially improve the fractions of reads that map uniquely and perfectly. Furthermore, we show that this fundamental simplification of read mapping transforms the variant calling problem from one in which many non-reference variants must be discovered de-novo to one in which the vast majority of variants are simply re-identified within the graph. Using standard benchmarks as well as a novel reference-free evaluation, we show that a simplistic variant calling procedure on a genome graph can already call variants at least as well as, and in many cases better than, a state-of-the-art method on the linear human reference genome. We anticipate that graph-based references will supplant linear references in humans and in other applications where cohorts of sequenced individuals are available.

[1]  Jonathan Kans,et al.  Entrez Direct: E-utilities on the UNIX Command Line , 2016 .

[2]  G. McVean,et al.  A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference , 2016, bioRxiv.

[3]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[4]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2016, Algorithms for Molecular Biology.

[5]  Pierre Peterlongo,et al.  Read mapping on de Bruijn graphs , 2015, BMC Bioinformatics.

[6]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[7]  John G. Cleary,et al.  Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines , 2015, bioRxiv.

[8]  Gonçalo R. Abecasis,et al.  Unified representation of genetic variants , 2015, Bioinform..

[9]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[10]  David Haussler,et al.  Canonical, stable, general mapping using context schemes , 2015, Bioinform..

[11]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[12]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[13]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[14]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[15]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[16]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[17]  Jacqueline A. L. MacArthur,et al.  Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants , 2013, Nucleic Acids Res..

[18]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[19]  M. Gerstein,et al.  The GENCODE pseudogene resource , 2012, Genome Biology.

[20]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[21]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[22]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[23]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[24]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[25]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[26]  Ryan E. Mills,et al.  Natural genetic variation caused by small insertions and deletions in the human genome. , 2011, Genome research.

[27]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[28]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[29]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[30]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[31]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[32]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[33]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[34]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[35]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[36]  James G. R. Gilbert,et al.  Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project , 2008, Immunogenetics.

[37]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[38]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[39]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[40]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[41]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.