A Theoretical Model for Whole Genome Alignment

We present a graph-based model for representing two aligned genomic sequences. An alignment graph is a mixed graph consisting of two sets of vertices, each representing one of the input sequences, and three sets of edges. These edges allow the model to represent a number of evolutionary events. This model is used to perform sequence alignment at the level of nucleotides. We define a scoring function for alignment graphs. We show that minimizing the score is NP-complete. However, we present a dynamic programming algorithm that solves the minimization problem optimally for a certain class of alignments, called breakable arrangements. Algorithms for analyzing breakable arrangements are presented. We also present a greedy algorithm that is capable of representing reversals. We present a dynamic programming algorithm that optimally aligns two genomic sequences, when one of the input sequences is a breakable arrangement of the other. Comparing what we define as breakable arrangements to alignments generated by other algorithms, it is seen that many already aligned genomes fall into the category of being breakable. Moreover, the greedy algorithm is shown to represent reversals, besides rearrangements, mutations, and other evolutionary events.

[1]  Mathieu Blanchette,et al.  Computation and analysis of genomic multi-sequence alignments. , 2007, Annual review of genomics and human genetics.

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[4]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[5]  Burkhard Morgenstern,et al.  Multiple alignment of genomic sequences using CHAOS, DIALIGN and ABC , 2005, Nucleic Acids Res..

[6]  Xiaoqiu Huang,et al.  MAP2: multiple alignment of syntenic genomic sequences , 2005, Nucleic acids research.

[7]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[8]  Tu Minh Phuong,et al.  Multiple alignment of protein sequences with repeats and rearrangements , 2006, Nucleic acids research.

[9]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[10]  Louis W. Shapiro,et al.  Bootstrap Percolation, the Schröder Numbers, and the N-Kings Problem , 1991, SIAM J. Discret. Math..

[11]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[14]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[15]  Iain M. Wallace,et al.  M-Coffee: combining multiple sequence alignment methods with T-Coffee , 2006, Nucleic acids research.

[16]  Marco Pagni,et al.  Dotlet: diagonal plots in a Web browser , 2000, Bioinform..

[17]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[18]  N. Grishin,et al.  MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information , 2006, Nucleic acids research.

[19]  Enno Ohlebusch,et al.  CoCoNUT: an efficient system for the comparison and analysis of genomes , 2008, BMC Bioinformatics.

[20]  Inna Dubchak,et al.  Multiple whole-genome alignments without a reference organism. , 2009, Genome research.

[21]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[22]  David Haussler,et al.  The infinite sites model of genome evolution , 2008, Proceedings of the National Academy of Sciences.

[23]  Richard Friedberg,et al.  Efficient sorting of genomic permutations by translocation, inversion and block interchange , 2005, Bioinform..

[24]  Jijoy Joseph,et al.  Chaos game representation for comparison of whole genomes , 2006, BMC Bioinformatics.

[25]  Jaap Heringa,et al.  AuberGene - a sensitive genome alignment tool , 2006, Bioinform..

[26]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[27]  Funda Ergün,et al.  Comparing Sequences with Segment Rearrangements , 2003, FSTTCS.

[28]  Yu Zhang,et al.  An Eulerian path approach to local multiple alignment for DNA sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Sonja J. Prohaska,et al.  Multiple sequence alignment with user-defined constraints at GOBICS , 2005, Bioinform..

[30]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[31]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[32]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[33]  Etsuko N. Moriyama,et al.  GenomeBlast: a web tool for small genome comparison , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).