EMAGEN: An Efficient Approach to Multiple Whole Genome Alignment

Following advances in biotechnology, many new whole genome sequences are becoming available every year. A lot of useful information can be derived from the alignment and comparison of different genomes. However, most of the current research focuses on pairwise genome alignment, and only a few available applications can efficiently align multiple genomes. In this paper, we present an efficient approach to align closely related multiple whole genomes, combining suffix arrays, graph theoretic formulation and existing tools for gap (short sequence) alignment. Our approach first finds a maximum set of aligned conserved regions among multiple whole genomes, then aligns the gaps between consecutive conserved regions with Clustal W. We present two methods to find the maximum set of aligned conserved regions among whole genomes. In first method, called Direct Matching (DM), multiple whole genomes are aligned with their DNA sequences. However, because most parts of prokaryotic genomes are encoded regions, we introduce second method, Functional Matching (FM), to especially align multiple prokaryotic genomes with their concatenated protein sequences. We present experimental results for both methods and give the analysis of the results. The FM method generates much better results for less closely related prokaryotic genomes than DM method. It outputs more and longer conserved regions, which conveys more accurate and detailed information about the conservation and inheritance of genomes, and generates more detailed alignments.

[1]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[2]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[3]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[4]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[5]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[8]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[9]  M. Golumbic Algorithmic graph theory and perfect graphs , 1980 .

[10]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[11]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[14]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[15]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[16]  Jeremy P. Spinrad,et al.  Modular decomposition and transitive orientation , 1999, Discret. Math..

[17]  M. Golumbic Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57) , 2004 .

[18]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[19]  Serge A. Hazout,et al.  A strategy for finding regions of similarity in complete genome sequences , 1998, Bioinform..

[20]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[21]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[22]  Jitender S. Deogun,et al.  On the Vertex Ranking Problem for Trapezoid, Circular-arc and Other Graphs , 1999, Discret. Appl. Math..

[23]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[24]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[25]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[26]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[27]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[28]  Esko Ukkonen,et al.  On{line Construction of Suux Trees 1 , 1995 .

[29]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[30]  N. W. Davis,et al.  Genome sequence of enterohaemorrhagic Escherichia coli O157:H7 , 2001, Nature.

[31]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[32]  David C. Schwartz,et al.  erratum Genome sequence of enterohaemorrhagic Escherichia coli 0157:H7 , 2001, Nature.