Fast and sensitive multiple alignment of large genomic sequences

BackgroundGenomic sequence alignment is a powerful method for genome analysis and annotation, as alignments are routinely used to identify functional sites such as genes or regulatory elements. With a growing number of partially or completely sequenced genomes, multiple alignment is playing an increasingly important role in these studies. In recent years, various tools for pair-wise and multiple genomic alignment have been proposed. Some of them are extremely fast, but often efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast search program identifies a chain of strong local sequence similarities. In a second step, regions between these anchor points are aligned using a slower but more accurate method.ResultsHerein, we present CHAOS, a novel algorithm for rapid identification of chains of local pair-wise sequence similarities. Local alignments calculated by CHAOS are used as anchor points to improve the running time of DIALIGN, a slow but sensitive multiple-alignment tool. We show that this way, the running time of DIALIGN can be reduced by more than 95% for BAC-sized and longer sequences, without affecting the quality of the resulting alignments. We apply our approach to a set of five genomic sequences around the stem-cell-leukemia (SCL) gene and demonstrate that exons and small regulatory elements can be identified by our multiple-alignment procedure.ConclusionWe conclude that the novel CHAOS local alignment tool is an effective way to significantly speed up global alignment tools such as DIALIGN without reducing the alignment quality. We likewise demonstrate that the DIALIGN/CHAOS combination is able to accurately align short regulatory sequences in distant orthologues.

[1]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[2]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[3]  B. Göttgens,et al.  Distinct Mechanisms Direct SCL/tal-1 Expression in Erythroid Cells and CD34 Positive Primitive Myeloid Cells* , 1997, The Journal of Biological Chemistry.

[4]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[5]  Burkhard Morgenstern,et al.  Speeding Up the DIALIGN Multiple Alignment Program by Using the 'Greedy Alignment of BIOlogical Sequences LIBrary' (GABIOS-LIB) , 2000, JOBIM.

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  Jill P. Mesirov,et al.  Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.

[8]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[9]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[10]  Berthold Göttgens,et al.  Transcriptional regulation of the stem cell leukemia gene (SCL)--comparative analysis of five vertebrate SCL loci. , 2002, Genome research.

[11]  Enno Ohlebusch,et al.  An Applications-focused Review of Comparative Genomics Tools: Capabilities, Limitations and Future Challenges , 2003, Briefings Bioinform..

[12]  B. Göttgens,et al.  Distinct 5' SCL enhancers direct transcription to developing brain, spinal cord, and endothelium: neural expression is mediated by GATA factor binding sites. , 1999, Developmental biology.

[13]  Burkhard Morgenstern,et al.  AGenDA: Gene prediction by comparative sequence analysis , 2002, Silico Biol..

[14]  R. Durbin,et al.  Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. , 1999, Genome research.

[15]  Enno Ohlebusch,et al.  Computation and Visualization of Degenerate Repeats in Complete Genomes , 2000, ISMB.

[16]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[17]  Berthold Göttgens,et al.  Analysis of vertebrate SCL loci identifies conserved enhancers , 2000, Nature Biotechnology.

[18]  J. Patrick Fitch,et al.  Rapid development of nucleic acid diagnostics , 2002, Proc. IEEE.

[19]  Webb Miller,et al.  Comparison of genomic DNA sequences: solved and unsolved problems , 2001, Bioinform..

[20]  O. Bernard,et al.  GATA-and SP1-binding sites are required for the full activity of the tissue-specific promoter of the tal-1 gene. , 1994, Oncogene.

[21]  W. Miller,et al.  Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. , 2000, Science.

[22]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[23]  M. Kreitman,et al.  Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. , 2001, Genome research.

[24]  Osamu Gotoh,et al.  Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps , 2000, Bioinform..

[25]  W Miller,et al.  Locus control regions of mammalian beta-globin gene clusters: combining phylogenetic analyses and experimental results to gain functional insights. , 1997, Gene.

[26]  Burkhard Morgenstern,et al.  A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences , 2002, Appl. Math. Lett..

[27]  Lior Pachter,et al.  MAVID multiple alignment server , 2003, Nucleic Acids Res..

[28]  W. Atchley,et al.  Evolution of bHLH transcription factors: modular evolution by domain shuffling? , 1999, Molecular biology and evolution.

[29]  Gregory W. Warr,et al.  An IgH Enhancer That Drives Transcription through Basic Helix-Loop-Helix and Oct Transcription Factor Binding Motifs , 2001, The Journal of Biological Chemistry.

[30]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[31]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[32]  Mathieu Blanchette,et al.  Algorithms for phylogenetic footprinting , 2001, RECOMB.

[33]  Daniel H. Huson,et al.  The Conserved Exon Method for Gene Finding , 2000, ISMB.

[34]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[35]  Burkhard Morgenstern,et al.  Exon discovery by genomic sequence alignment , 2002, Bioinform..

[36]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[37]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[38]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[39]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[40]  Berthold Göttgens,et al.  Regulation of the stem cell leukemia (SCL) gene: A tale of two fishes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[41]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[42]  C. Begley,et al.  Lineage-restricted regulation of the murine SCL/TAL-1 promoter. , 1995, Blood.

[43]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[44]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[45]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[46]  E. Jennings,et al.  DNA binding sites for the transcriptional activator/repressor YY1. , 1995, Nucleic acids research.

[47]  Martin Vingron,et al.  Annotating regulatory DNA based on man-mouse genomic comparison , 2002, ECCB.

[48]  D R Bentley,et al.  Long-range comparison of human and mouse SCL loci: localized regions of sensitivity to restriction endonucleases correspond precisely with peaks of conserved noncoding sequences. , 2001, Genome research.

[49]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[50]  D. Church,et al.  Cross-species sequence comparisons: a review of methods and available resources. , 2003, Genome research.

[51]  A. Green,et al.  The SCL gene: from case report to critical hematopoietic regulator. , 1999, Blood.

[52]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[53]  Alexander Sczyrba,et al.  AGenDA: homology-based gene prediction , 2003, Bioinform..

[54]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[55]  I-Min A. Dubchak,et al.  Active conservation of noncoding sequences revealed by three-way species comparisons. , 2000, Genome research.

[56]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.