Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Background With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. Methodology/Principal Findings Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. Conclusions/Significance Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.

[1]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[2]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[3]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[4]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[5]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[6]  B. Birren,et al.  Dynamics of Pseudomonas aeruginosa genome evolution , 2008, Proceedings of the National Academy of Sciences.

[7]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[8]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[9]  I-Min A. Chen,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[10]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[13]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata , 2007, Nucleic Acids Res..

[14]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[15]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[16]  Ioan Tabus,et al.  Genome compression using normalized maximum likelihood models for constrained Markov sources , 2008, 2008 IEEE Information Theory Workshop.

[17]  L. Kish End of Moore's law: thermal (noise) death of integration in micro and nano electronics , 2002 .

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[20]  尚弘 島影 National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[21]  P. Pevzner,et al.  Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. , 2003, Genome research.

[22]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[23]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[24]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[25]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[27]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[28]  Lior Pachter,et al.  Parametric Alignment of Drosophila Genomes , 2005, PLoS Comput. Biol..

[29]  Franco P. Preparata,et al.  Quick, Practical Selection of Effective Seeds for Homology Search , 2005, J. Comput. Biol..

[30]  Enno Ohlebusch,et al.  Space Efficient Computation of Rare Maximal Exact Matches between Multiple Sequences , 2008, J. Comput. Biol..

[31]  P. Pevzner,et al.  Genome-scale evolution: reconstructing gene orders in the ancestral species. , 2002, Genome research.

[32]  Bruce T. Lahn,et al.  SPEED: a molecular-evolution-based database of mammalian orthologous groups , 2006, Bioinform..

[33]  Kris Popendorf,et al.  Accurate identification of orthologous segments among multiple genomes , 2009, Bioinform..

[34]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.