Mugsy: fast multiple alignment of closely related whole genomes

Motivation: The relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution. Results: We present a new multiple alignment tool for whole genomes named Mugsy. Mugsy is computationally efficient and can align 31 Streptococcus pneumoniae genomes in less than 2 hours producing alignments that compare favorably to other tools. Mugsy is also the fastest program evaluated for the multiple alignment of assembled human chromosome sequences from four individuals. Mugsy does not require a reference sequence, can align mixtures of assembled draft and completed genome data, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence. Availability: Mugsy is free, open-source software available from http://mugsy.sf.net. Contact: angiuoli@cs.umd.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[2]  Meriem El Karoui,et al.  A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera , 2008, Journal of bacteriology.

[3]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[4]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[5]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[6]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[7]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[8]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[9]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[12]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[13]  Serafim Batzoglou,et al.  The many faces of sequence alignment , 2005, Briefings Bioinform..

[14]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[15]  Kiem-Phong Vo,et al.  Heaviest Increasing/Common Subsequence Problems , 1992, CPM.

[16]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[17]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[18]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[19]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[20]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[21]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[22]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[23]  Inna Dubchak,et al.  Multiple whole-genome alignments without a reference organism. , 2009, Genome research.

[24]  Colin N. Dewey,et al.  Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. , 2007, Genome research.

[25]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[26]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[27]  P. Pevzner,et al.  Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. , 2003, Genome research.

[28]  Knut Reinert,et al.  Segment-based multiple sequence alignment , 2008, ECCB.

[29]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[30]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[31]  Colin N. Dewey,et al.  Evolution at the nucleotide level: the problem of multiple whole-genome alignment. , 2006, Human molecular genetics.

[32]  Burkhard Morgenstern,et al.  A min-cut algorithm for the consistency problem in multiple sequence alignment , 2010, Bioinform..

[33]  Xiaoyu Chen,et al.  Comparative assessment of methods for aligning multiple genome sequences , 2010, Nature Biotechnology.

[34]  Yu Zhang,et al.  An Eulerian path approach to local multiple alignment for DNA sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Colin N. Dewey,et al.  Aligning multiple whole genomes with Mercator and MAVID. , 2007, Methods in molecular biology.

[36]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[37]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[38]  Xavier Messeguer,et al.  M-GCAT: interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species , 2006, BMC Bioinformatics.

[39]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[40]  Isaac Elias,et al.  Settling the Intractability of Multiple Alignment , 2003, ISAAC.

[41]  Sangsoo Kim,et al.  The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. , 2009, Genome research.

[42]  D. R. Fulkerson,et al.  Maximal Flow Through a Network , 1956 .

[43]  Benedict Paten,et al.  Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment , 2009, Bioinform..

[44]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[45]  P. Pevzner,et al.  Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. , 2004, Genome research.