Building a Pangenome Reference for a Population

A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalise the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pangenome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pangenome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test.

[1]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[2]  David Sankoff,et al.  Multichromosomal median and halving problems under different genomic distances , 2009, BMC Bioinformatics.

[3]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[4]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[5]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[6]  A. Griffiths Introduction to Genetic Analysis , 1976 .

[7]  J. Harrow,et al.  The GENCODE exome: sequencing the complete human exome , 2011, European Journal of Human Genetics.

[8]  Mathieu Blanchette,et al.  Genetic Map Refinement Using a Comparative Genomic Approach , 2009, J. Comput. Biol..

[9]  Annie Chateau,et al.  Computation of Perfect DCJ Rearrangement Scenarios with Linear and Circular Chromosomes , 2009, J. Comput. Biol..

[10]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[11]  Andrew Wei Xu,et al.  A Fast and Exact Algorithm for the Median of Three Problem: A Graph Decomposition Approach , 2009, J. Comput. Biol..

[12]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[13]  M. Kirkpatrick How and Why Chromosome Inversions Evolve , 2010, PLoS biology.

[14]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2011 , 2011, Nucleic Acids Res..

[15]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[16]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[17]  David Haussler,et al.  HAL: a hierarchical format for storing and analyzing multiple genome alignments , 2013, Bioinform..

[18]  B. Bollobás The evolution of random graphs , 1984 .