progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement

Background Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. Methodology/Principal Findings We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pan-genome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence. Conclusions The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve.

[1]  K. Rudd,et al.  Integration host factor binds to a unique class of complex repetitive extragenic DNA sequences in Escherichia coli , 1993, Molecular microbiology.

[2]  Kun-Mao Chao,et al.  A local alignment tool for very long DNA sequences , 1995, Comput. Appl. Biosci..

[3]  Pavel A. Pevzner,et al.  Transforming men into mice (polynomial algorithm for genomic distance problem) , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[4]  Blanchette,et al.  Breakpoint Phylogenies. , 1997, Genome informatics. Workshop on Genome Informatics.

[5]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[6]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[7]  G. Pupo,et al.  Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Lior Pachter,et al.  VISTA : visualizing global DNA sequence alignments of arbitrary length , 2000, Bioinform..

[9]  W. Fitch Homology a personal view on some of the problems. , 2000, Trends in genetics : TIG.

[10]  N. W. Davis,et al.  Genome sequence of enterohaemorrhagic Escherichia coli O157:H7 , 2001, Nature.

[11]  J. Kadane,et al.  Bayesian phylogenetic inference from animal mitochondrial genome arrangements , 2002 .

[12]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[13]  F. Blattner,et al.  Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[15]  Aleksey Y. Ogurtsov,et al.  OWEN: aligning long collinear regions of genomes , 2002, Bioinform..

[16]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[17]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[18]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Inna Dubchak,et al.  Glocal alignment: finding rearrangements during alignment , 2003, ISMB.

[20]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[21]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[22]  E. Rocha,et al.  Associations between inverted repeats and the structural evolution of bacterial genomes. , 2003, Genetics.

[23]  Lior Pachter,et al.  MAVID multiple alignment server , 2003, Nucleic Acids Res..

[24]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[25]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[26]  Jijun Tang,et al.  Scaling up accurate phylogenetic reconstruction from gene-order data , 2003, ISMB.

[27]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[28]  Aaron E. Darling,et al.  GRIL: genome rearrangement and inversion locator , 2004, Bioinform..

[29]  Sorin Istrail,et al.  Finding anchors for genomic sequence comparison , 2004, RECOMB.

[30]  Michael Brudno,et al.  The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences , 2004, Nucleic Acids Res..

[31]  Marie-France Sagot,et al.  Sorting by Reversals in Subquadratic Time , 2004, CPM.

[32]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[33]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[34]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[35]  Philip E. Bourne,et al.  Proceedings of the eighth annual international conference on Research in computational molecular biology , 2004, RECOMB 2004.

[36]  J. Glasner,et al.  Genome-wide detection and analysis of homologous recombination among sequenced strains of Escherichia coli , 2006, Genome Biology.

[37]  Korine S. E. Ung,et al.  Evidence of a Large Novel Gene Pool Associated with Prokaryotic Genomic Islands , 2005, PLoS genetics.

[38]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[39]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Richard Friedberg,et al.  Efficient sorting of genomic permutations by translocation, inversion and block interchange , 2005, Bioinform..

[41]  Yu Zhang,et al.  An Eulerian path approach to local multiple alignment for DNA sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[42]  W. Miller,et al.  Mulan: multiple-sequence local alignment and visualization for studying function and evolution. , 2005, Genome research.

[43]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  Tu Minh Phuong,et al.  Multiple alignment of protein sequences with repeats and rearrangements , 2006, Nucleic acids research.

[45]  Jun Yu,et al.  Revisiting the Molecular Evolutionary History of Shigella spp. , 2006, Journal of Molecular Evolution.

[46]  Ron Y. Pinter,et al.  An Integrative Method for Accurate Comparative Genome Mapping , 2006, PLoS Comput. Biol..

[47]  Jens Stoye,et al.  A Unifying View of Genome Rearrangements , 2006, WABI.

[48]  Xavier Messeguer,et al.  Procrastination Leads to Efficient Filtration for Local Multiple Alignment , 2006, WABI.

[49]  Colin N. Dewey,et al.  Evolution at the nucleotide level: the problem of multiple whole-genome alignment. , 2006, Human molecular genetics.

[50]  Xavier Messeguer,et al.  M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species , 2006, BMC Bioinformatics.

[51]  Ron Y. Pinter,et al.  On the Repeat-Annotated Phylogenetic Tree Reconstruction Problem , 2006, J. Comput. Biol..

[52]  Georgios S. Vernikos,et al.  Genetic flux over time in the Salmonella lineage , 2007, Genome Biology.

[53]  Le Sy Vinh,et al.  Pairwise alignment with rearrangements. , 2006, Genome informatics. International Conference on Genome Informatics.

[54]  A. Prakash,et al.  Measuring the accuracy of genome-size multiple alignments , 2007, Genome Biology.

[55]  Sudhir Kumar,et al.  Multiple sequence alignment: in pursuit of homologous DNA positions. , 2007, Genome research.

[56]  Justin S. Hogg,et al.  Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains , 2007, Genome Biology.

[57]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[58]  D. Falush,et al.  Inference of Bacterial Microevolution Using Multilocus Sequence Data , 2007, Genetics.

[59]  Colin N. Dewey,et al.  Aligning multiple whole genomes with Mercator and MAVID. , 2007, Methods in molecular biology.

[60]  Tao Jiang,et al.  MSOAR: A High-Throughput Ortholog Assignment System Based on Genome Rearrangement , 2007, J. Comput. Biol..

[61]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[62]  Gerton Lunter,et al.  Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes , 2007, ISMB/ECCB.

[63]  J. Roth,et al.  Ohno's dilemma: Evolution of new genes under continuous selection , 2007, Proceedings of the National Academy of Sciences.

[64]  Max A. Alekseyev,et al.  Multi-Break Rearrangements and Breakpoint Re-Uses: From Circular to Linear Genomes , 2008, J. Comput. Biol..

[65]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[66]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[67]  David Haussler,et al.  The infinite sites model of genome evolution , 2008, Proceedings of the National Academy of Sciences.

[68]  I. Miklós,et al.  Dynamics of Genome Rearrangement in Bacterial Populations , 2008, PLoS genetics.

[69]  P. Gajer,et al.  The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates , 2008, Journal of bacteriology.

[70]  Inna Dubchak,et al.  Multiple whole-genome alignments without a reference organism. , 2009, Genome research.

[71]  Xavier Messeguer,et al.  A Novel Heuristic for Local Multiple Alignment of Interspersed DNA Repeats , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[72]  D. Falush,et al.  Inferring genomic flux in bacteria. , 2009, Genome research.

[73]  J. Lagergren,et al.  Simultaneous Bayesian gene tree reconstruction and reconciliation analysis , 2009, Proceedings of the National Academy of Sciences.

[74]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[75]  Jonathan A. Eisen,et al.  BioTorrents: A File Sharing Service for Scientific Data , 2010, PloS one.