Genome graphs and the evolution of genome inference

The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.

[1]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[2]  Benedict Paten,et al.  Modelling haplotypes with respect to reference cohort variation graphs , 2017, bioRxiv.

[3]  Glenn Hickey,et al.  Superbubbles, Ultrabubbles and Cacti , 2017, bioRxiv.

[4]  David Haussler,et al.  A Flow Procedure for Linearization of Genome Sequence Graphs , 2018, J. Comput. Biol..

[5]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[6]  Yadong Wang,et al.  deBGA: read alignment with de Bruijn graph-based seed and extension , 2016, Bioinform..

[7]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[8]  Alexander T. Dilthey,et al.  High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs , 2016, PLoS Comput. Biol..

[9]  Steven J. M. Jones,et al.  Assembly of the Complete Sitka Spruce Chloroplast Genome Using 10X Genomics’ GemCode Sequencing Data , 2016, PloS one.

[10]  Jeffrey T Leek,et al.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown , 2016, Nature Protocols.

[11]  D. Goldstein,et al.  Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine , 2016, Genome Biology.

[12]  Geir Kjetil Sandve,et al.  Coordinates and intervals in graph-based reference genomes , 2017, BMC Bioinformatics.

[13]  G. McVean,et al.  A natural encoding of genetic variation in a Burrows-Wheeler Transform to enable mapping and genome inference , 2016, bioRxiv.

[14]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2016, Algorithms for Molecular Biology.

[15]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[16]  Pierre Peterlongo,et al.  Read mapping on de Bruijn graphs , 2015, BMC Bioinformatics.

[17]  Brendan L. O’Connell,et al.  Chromosome-scale shotgun assembly using an in vitro method for long-range linkage , 2015, Genome research.

[18]  Esten Høyland Leonardsen,et al.  Aligning reads against a graph based reference genome , 2016 .

[19]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[20]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[21]  Karen H. Miga,et al.  Completing the human genome: the progress and challenge of satellite DNA assembly , 2015, Chromosome Research.

[22]  L. Jain The future of personalized and precision perinatal medicine. Foreword. , 2015, Clinics of Perinatology.

[23]  David Haussler,et al.  Building a Pan-Genome Reference for a Population , 2015, J. Comput. Biol..

[24]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[25]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[26]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[27]  David Haussler,et al.  Canonical, stable, general mapping using context schemes , 2015, Bioinform..

[28]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[29]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[30]  Pierre Peterlongo,et al.  Read Mapping on de Bruijn graph , 2015, ArXiv.

[31]  Faraz Hach,et al.  mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications , 2014, Nucleic Acids Res..

[32]  Adam M. Novak,et al.  Mapping to a Reference Genome Structure , 2014, 1404.5010.

[33]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[34]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[35]  Nicolas Altemose,et al.  Centromere reference models for human chromosomes X and Y satellite arrays , 2013, Genome research.

[36]  Kunihiko Sadakane,et al.  Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[37]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[38]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[39]  Guillaume Holley,et al.  BlastGraph: intensive approximate pattern matching in string graphs and de-Bruijn graphs , 2012 .

[40]  Kay Nieselt,et al.  GenomeRing: alignment visualization based on SuperGenome coordinates , 2012, Bioinform..

[41]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[42]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[43]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[44]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[45]  A. Need,et al.  Next generation disparities in human genomics: concerns and remedies. , 2009, Trends in genetics : TIG.

[46]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[47]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[48]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[49]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[50]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[51]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[52]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[53]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[54]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[55]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[56]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[57]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[58]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[59]  Jack Edmonds,et al.  Matching: A Well-Solved Class of Integer Linear Programs , 2001, Combinatorial Optimization.

[60]  E. Betrán,et al.  Recombination and gene flux caused by gene conversion and crossing over in inversion heterokaryotypes. , 1997, Genetics.

[61]  Huntington F. Willard,et al.  Hierarchical order in chromosome-specific human alpha satellite DNA , 1987 .

[62]  L. Manuelidis,et al.  Homology between human and simian repeated DNA , 1978, Nature.

[63]  de Ng Dick Bruijn A combinatorial problem , 1946 .