A Unifying Parsimony Model of Genome Evolution

The study of molecular evolution rests on the classical fields of population genetics and systematics, but the increasing availability of DNA sequence data has broadened the field in the last decades, leading to new theories and methodologies. This includes parsimony and maximum likelihood methods of phylogenetic tree estimation, the theory of genome rearrangements, and the coalescent model with recombination. These all interact in the study of genome evolution, yet to date they have only been pursued in isolation. We present the first unified parsimony framework for the study of genome evolutionary histories that includes all of these aspects, proposing a graphical data structure called a history graph that is intended to form a practical basis for analysis. We define tractable upper and lower bound parsimony cost functions on history graphs that incorporate both substitutions and rearrangements. We demonstrate that these bounds become tight for a special unambiguous type of history graph called an ancestral variation graph (AVG), which captures in its combinatorial structure the operations required in an evolutionary history. For an input history graph G, we demonstrate that there exists a finite set of interpretations of G that contains all minimal (lacking extraneous elements) and most parsimonious AVG interpretations of G. We define a partial order over this set and an associated set of sampling moves that can be used to explore these DNA histories. These results generalise and conceptually simplify the problem

[1]  W. H. Day Computational complexity of inferring phylogenies from dissimilarity matrices. , 1987, Bulletin of mathematical biology.

[2]  D. Bienstock,et al.  Algorithmic Implications of the Graph Minor Theorem , 1995 .

[3]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[4]  Yun S. Song,et al.  Constructing Minimal Ancestral Recombination Graphs , 2005, J. Comput. Biol..

[5]  Jack Edmonds,et al.  Matching: A Well-Solved Class of Integer Linear Programs , 2001, Combinatorial Optimization.

[6]  Isaac Elias,et al.  Settling the Intractability of Multiple Alignment , 2003, ISAAC.

[7]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[8]  Jens Stoye,et al.  Double Cut and Join with Insertions and Deletions , 2011, J. Comput. Biol..

[9]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[10]  Mathieu Blanchette,et al.  On the Inference of Parsimonious Indel Evolutionary Scenarios , 2006, J. Bioinform. Comput. Biol..

[11]  Martin Bader,et al.  Genome rearrangements with duplications , 2010, BMC Bioinformatics.

[12]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[13]  Kaizhong Zhang,et al.  Perfect Phylogenetic Networks with Recombination , 2001, J. Comput. Biol..

[14]  David Sankoff,et al.  Multichromosomal median and halving problems under different genomic distances , 2009, BMC Bioinformatics.

[15]  Pavel A. Pevzner,et al.  Multi-break rearrangements and chromosomal evolution , 2008, Theor. Comput. Sci..

[16]  E. Birney,et al.  Genome-wide nucleotide-level mammalian ancestor reconstruction. , 2008, Genome research.

[17]  Richard Friedberg,et al.  Efficient sorting of genomic permutations by translocation, inversion and block interchange , 2005, Bioinform..

[18]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[19]  P. Pevzner,et al.  Genome-scale evolution: reconstructing gene orders in the ancestral species. , 2002, Genome research.

[20]  Saurabh Sinha,et al.  Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment , 2007, Bioinform..

[21]  Oscar Westesson,et al.  Accurate Detection of Recombinant Breakpoints in Whole-Genome Alignments , 2009, PLoS Comput. Biol..

[22]  Richard Friedberg,et al.  DCJ Path Formulation for Genome Transformations which Include Insertions, Deletions, and Duplications , 2009, J. Comput. Biol..

[23]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[24]  Jens Stoye,et al.  On Sorting by Translocations , 2005, RECOMB.

[25]  D. Haussler,et al.  Reconstructing large regions of an ancestral mammalian genome in silico. , 2004, Genome research.

[26]  Pavel A. Pevzner,et al.  Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals , 1995, JACM.

[27]  David Haussler,et al.  The infinite sites model of genome evolution , 2008, Proceedings of the National Academy of Sciences.