Cactus: Algorithms for genome multiple sequence alignment.

Much attention has been given to the problem of creating reliable multiple sequence alignments in a model incorporating substitutions, insertions, and deletions. Far less attention has been paid to the problem of optimizing alignments in the presence of more general rearrangement and copy number variation. Using Cactus graphs, recently introduced for representing sequence alignments, we describe two complementary algorithms for creating genomic alignments. We have implemented these algorithms in the new "Cactus" alignment program. We test Cactus using the Evolver genome evolution simulator, a comprehensive new tool for simulation, and show using these and existing simulations that Cactus significantly outperforms all of its peers. Finally, we make an empirical assessment of Cactus's ability to properly align genes and find interesting cases of intra-gene duplication within the primates.

[1]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[2]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[3]  Rainer Fuchs,et al.  CLUSTAL V: improved software for multiple sequence alignment , 1992, Comput. Appl. Biosci..

[4]  D. Haussler,et al.  Reconstructing large regions of an ancestral mammalian genome in silico. , 2004, Genome research.

[5]  Colin N. Dewey,et al.  Aligning multiple whole genomes with Mercator and MAVID. , 2007, Methods in molecular biology.

[6]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[8]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[9]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[10]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[11]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[12]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[13]  Lior Pachter,et al.  Multiple alignment by sequence annealing , 2007, Bioinform..

[14]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[15]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[16]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[17]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[18]  Benedict Paten,et al.  Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment , 2009, Bioinform..

[19]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[20]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[21]  Inna Dubchak,et al.  Glocal alignment: finding rearrangements during alignment , 2003, ISMB.

[22]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[23]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[24]  Inna Dubchak,et al.  Multiple whole-genome alignments without a reference organism. , 2009, Genome research.

[25]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[26]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[27]  F Harary,et al.  On the Number of Husimi Trees: I. , 1953, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[29]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[30]  I. Holmes,et al.  Tools for simulating evolution of aligned genomic regions with integrated parameter estimation , 2008, Genome Biology.

[31]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[32]  Yung H. Tsin A Simple 3-Edge-Connected Component Algorithm , 2005, Theory of Computing Systems.