Building a Pan-Genome Reference for a Population

A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalize the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pan-genome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pan-genome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test. In addition, we test the use of a pan-genome for describing variations and as a reference for read mapping.

[1]  David Sankoff,et al.  Multichromosomal median and halving problems under different genomic distances , 2009, BMC Bioinformatics.

[2]  Sophie Palmer,et al.  Genetic Analysis of Completely Sequenced Disease-Associated MHC Haplotypes Identifies Shuffling of Segments in Recent Human History , 2006, PLoS genetics.

[3]  Mathieu Blanchette,et al.  Genetic Map Refinement Using a Comparative Genomic Approach , 2009, J. Comput. Biol..

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[6]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[7]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[8]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[9]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[10]  Andrew Wei Xu,et al.  A Fast and Exact Algorithm for the Median of Three Problem: A Graph Decomposition Approach , 2009, J. Comput. Biol..

[11]  Sophie Palmer,et al.  Complete MHC haplotype sequencing for common disease gene mapping. , 2004, Genome research.

[12]  James G. R. Gilbert,et al.  Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project , 2008, Immunogenetics.

[13]  J. Harrow,et al.  The GENCODE exome: sequencing the complete human exome , 2011, European Journal of Human Genetics.

[14]  Annie Chateau,et al.  Computation of Perfect DCJ Rearrangement Scenarios with Linear and Circular Chromosomes , 2009, J. Comput. Biol..

[15]  M. Kirkpatrick How and Why Chromosome Inversions Evolve , 2010, PLoS biology.

[16]  J. Traherne,et al.  Human MHC architecture and evolution: implications for disease association studies , 2008, International journal of immunogenetics.

[17]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[18]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[19]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .